The project I am angling towards deals with my website's traffic and taxonomy data. I will be trying to build models that can accurately predict which tags are best for specific channels of traffic, and will also be investigating the longitudinal nature of an article's lifecycle (days to 90% of traffic is my current threshold for an article being done, but I will refine that with some descriptive statistics).

I'll also want to investigate what kind of content performs best in each month.



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

First, import the table of tag-article mappings from our SQL db



In [2]:

    
df = pd.read_csv('atlas-taggings.csv')



In [3]:

    
df.head(10)









    Out[3]:






  
    
      
      tag_id
      tag_url
      tagged_type
      tagged_id
      tagged_url
    
  
  
    
      0
      36
      www.atlasobscura.com/categories/abandoned
      Place
      9982
      www.atlasobscura.com/places/athens-olympic-spo...
    
    
      1
      2
      www.atlasobscura.com/categories/panoramas
      Place
      1676
      www.atlasobscura.com/places/velaslavasay-panorama
    
    
      2
      2
      www.atlasobscura.com/categories/panoramas
      Place
      6431
      www.atlasobscura.com/places/gettysburg-cyclorama
    
    
      3
      2
      www.atlasobscura.com/categories/panoramas
      Article
      2311
      www.atlasobscura.com/articles/rip-gettysburg-c...
    
    
      4
      258
      www.atlasobscura.com/categories/bridges
      Place
      10134
      www.atlasobscura.com/places/gimbel-s-bridge
    
    
      5
      2
      www.atlasobscura.com/categories/panoramas
      Place
      6430
      www.atlasobscura.com/places/borodino-panorama
    
    
      6
      2
      www.atlasobscura.com/categories/panoramas
      Place
      6428
      www.atlasobscura.com/places/panorama-mesdag
    
    
      7
      2
      www.atlasobscura.com/categories/panoramas
      Place
      3688
      www.atlasobscura.com/places/panorama-raclawice
    
    
      8
      3
      www.atlasobscura.com/categories/disasters
      Place
      6343
      www.atlasobscura.com/places/mars-bluff-crater
    
    
      9
      4
      www.atlasobscura.com/categories/atom-bombs
      Place
      6343
      www.atlasobscura.com/places/mars-bluff-crater



In [4]:

    
articles = df[df.tagged_type == 'Article']

We only care about the articles for this analysis. Place entries are outside scope.



In [5]:

    
articles.head()









    Out[5]:






  
    
      
      tag_id
      tag_url
      tagged_type
      tagged_id
      tagged_url
    
  
  
    
      3
      2
      www.atlasobscura.com/categories/panoramas
      Article
      2311
      www.atlasobscura.com/articles/rip-gettysburg-c...
    
    
      56
      27
      www.atlasobscura.com/categories/objects-of-int...
      Article
      2227
      www.atlasobscura.com/articles/objects-of-intri...
    
    
      57
      27
      www.atlasobscura.com/categories/objects-of-int...
      Article
      2268
      www.atlasobscura.com/articles/objects-of-intri...
    
    
      58
      27
      www.atlasobscura.com/categories/objects-of-int...
      Article
      2213
      www.atlasobscura.com/articles/objects-of-intri...
    
    
      62
      27
      www.atlasobscura.com/categories/objects-of-int...
      Article
      2216
      www.atlasobscura.com/articles/objects-of-intri...

Extract the tag name from the tag's URL



In [34]:

    
def get_tag(x):
    return x.split('/')[2]

#changing this function to get_tag_name() in module.









    



---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-34-e82a2969e1fb> in <module>()
      2 def get_tag(x):
      3     return x.split('/')[2]
----> 4 tag_mapping.tag_url = tag_mapping.tag_url.apply(get_tag)
      5 #changing this function to get_tag_name() in module.

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2235             values = lib.map_infer(values, boxer)
   2236 
-> 2237         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2238         if len(mapped) and isinstance(mapped[0], Series):
   2239             from pandas.core.frame import DataFrame

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:63043)()

<ipython-input-34-e82a2969e1fb> in get_tag(x)
      1 tag_mapping.head()
      2 def get_tag(x):
----> 3     return x.split('/')[2]
      4 tag_mapping.tag_url = tag_mapping.tag_url.apply(get_tag)
      5 #changing this function to get_tag_name() in module.

IndexError: list index out of range

Create a tag_url column that just has the tag's name



In [10]:

    
articles.tag_url = articles.tag_url.apply(get_tag)
articles.head()









    Out[10]:






  
    
      
      tag_id
      tag_url
      tagged_type
      tagged_id
      tagged_url
    
  
  
    
      3
      2
      panoramas
      Article
      2311
      www.atlasobscura.com/articles/rip-gettysburg-c...
    
    
      56
      27
      objects-of-intrigue
      Article
      2227
      www.atlasobscura.com/articles/objects-of-intri...
    
    
      57
      27
      objects-of-intrigue
      Article
      2268
      www.atlasobscura.com/articles/objects-of-intri...
    
    
      58
      27
      objects-of-intrigue
      Article
      2213
      www.atlasobscura.com/articles/objects-of-intri...
    
    
      62
      27
      objects-of-intrigue
      Article
      2216
      www.atlasobscura.com/articles/objects-of-intri...

Get dummies for each tag



In [11]:

    
test = pd.get_dummies(articles.tag_url)



In [12]:

    
test.head()









    Out[12]:






  
    
      
      100-wonders
      19th-century
      2016-election
      30-rock
      31-days-of-halloween
      abandoned
      abandoned-amusement-parks
      abandoned-brooklyn
      abandoned-cemetaries
      abandoned-hospitals
      ...
      world-s-smallest
      world-s-tallest
      world-war-ii
      wunderkammer
      wwi
      wwii
      yehlui-geological-park
      yeti
      zombies
      zoos
    
  
  
    
      3
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      56
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      57
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      58
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      62
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 983 columns

Join the dummies back to the main dataframe



In [13]:

    
articles = articles.join(test)



In [14]:

    
articles.drop(['tag_id','tag_url','tagged_type','tagged_id'],axis=1,inplace=True)



In [15]:

    
articles.head()









    Out[15]:






  
    
      
      tagged_url
      100-wonders
      19th-century
      2016-election
      30-rock
      31-days-of-halloween
      abandoned
      abandoned-amusement-parks
      abandoned-brooklyn
      abandoned-cemetaries
      ...
      world-s-smallest
      world-s-tallest
      world-war-ii
      wunderkammer
      wwi
      wwii
      yehlui-geological-park
      yeti
      zombies
      zoos
    
  
  
    
      3
      www.atlasobscura.com/articles/rip-gettysburg-c...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      56
      www.atlasobscura.com/articles/objects-of-intri...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      57
      www.atlasobscura.com/articles/objects-of-intri...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      58
      www.atlasobscura.com/articles/objects-of-intri...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
    
      62
      www.atlasobscura.com/articles/objects-of-intri...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
    
  

5 rows × 984 columns

De-dupe articles but maintain the tagging data using groupby and sum



In [16]:

    
unique_articles = articles.groupby('tagged_url').sum() #made into func



In [17]:

    
unique_articles = unique_articles.reset_index()



In [18]:

    
unique_articles = unique_articles.set_index('tagged_url')

Using a csv generated by a script I wrote that queries Google Analytics for pageviews per article from publish date to n-days post-publication, import pageview data and join it to the tag/article DataFrame



In [19]:

    
#now we need the pageviews and have to map the URLs to Page Titles
pageviews = pd.read_csv('output_articles_performance.csv',header=None,names=['url','published','pageviews'])
pageviews.head()
#In the future I should import the module and run it here instead of grabbing.









    Out[19]:






  
    
      
      url
      published
      pageviews
    
  
  
    
      0
      jamaica-may-get-rid-of-queen-elizabeth-and-fin...
      2016-04-15
      3997
    
    
      1
      trippy-blacklight-posters-from-the-psychedelic...
      2016-04-15
      7042
    
    
      2
      leonardo-da-vincis-living-descendants-have-bee...
      2016-04-15
      12448
    
    
      3
      catapult-into-the-weekend-like-this-gopro-off-...
      2016-04-15
      4187
    
    
      4
      cat-rescued-after-4-days-stuck-on-insanely-tal...
      2016-04-15
      2721



In [20]:

    
pageviews.url = ['www.atlasobscura.com/articles/' + x for x in pageviews.url]



In [21]:

    
pageviews.head()









    Out[21]:






  
    
      
      url
      published
      pageviews
    
  
  
    
      0
      www.atlasobscura.com/articles/jamaica-may-get-...
      2016-04-15
      3997
    
    
      1
      www.atlasobscura.com/articles/trippy-blackligh...
      2016-04-15
      7042
    
    
      2
      www.atlasobscura.com/articles/leonardo-da-vinc...
      2016-04-15
      12448
    
    
      3
      www.atlasobscura.com/articles/catapult-into-th...
      2016-04-15
      4187
    
    
      4
      www.atlasobscura.com/articles/cat-rescued-afte...
      2016-04-15
      2721



In [22]:

    
pageviews.describe()









    Out[22]:






  
    
      
      pageviews
    
  
  
    
      count
      3446.000000
    
    
      mean
      7052.891759
    
    
      std
      23256.215270
    
    
      min
      1.000000
    
    
      25%
      1150.250000
    
    
      50%
      2571.500000
    
    
      75%
      5834.750000
    
    
      max
      621494.000000

Set the pageviews index to the url column to make joining easy



In [23]:

    
pageviews.set_index('url',inplace=True)



In [24]:

    
article_set = unique_articles.join(pageviews)



In [25]:

    
article_set.head()









    Out[25]:






  
    
      
      100-wonders
      19th-century
      2016-election
      30-rock
      31-days-of-halloween
      abandoned
      abandoned-amusement-parks
      abandoned-brooklyn
      abandoned-cemetaries
      abandoned-hospitals
      ...
      world-war-ii
      wunderkammer
      wwi
      wwii
      yehlui-geological-park
      yeti
      zombies
      zoos
      published
      pageviews
    
    
      tagged_url
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-01
      651.0
    
    
      www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-09
      3505.0
    
    
      www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-12
      840.0
    
    
      www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-30
      4037.0
    
    
      www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-01-07
      1620.0
    
  

5 rows × 985 columns

Reset index



In [26]:

    
article_set.reset_index()









    Out[26]:






  
    
      
      tagged_url
      100-wonders
      19th-century
      2016-election
      30-rock
      31-days-of-halloween
      abandoned
      abandoned-amusement-parks
      abandoned-brooklyn
      abandoned-cemetaries
      ...
      world-war-ii
      wunderkammer
      wwi
      wwii
      yehlui-geological-park
      yeti
      zombies
      zoos
      published
      pageviews
    
  
  
    
      0
      www.atlasobscura.com/articles/10-little-known-...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-01
      651.0
    
    
      1
      www.atlasobscura.com/articles/10-of-the-greate...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-09
      3505.0
    
    
      2
      www.atlasobscura.com/articles/10-places-12-yea...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-12
      840.0
    
    
      3
      www.atlasobscura.com/articles/10-things-that-y...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-30
      4037.0
    
    
      4
      www.atlasobscura.com/articles/100-wonders-a-vi...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-01-07
      1620.0
    
    
      5
      www.atlasobscura.com/articles/100-wonders-an-i...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-20
      4049.0
    
    
      6
      www.atlasobscura.com/articles/100-wonders-batt...
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-17
      2727.0
    
    
      7
      www.atlasobscura.com/articles/100-wonders-bloo...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-15
      1290.0
    
    
      8
      www.atlasobscura.com/articles/100-wonders-clow...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-21
      1450.0
    
    
      9
      www.atlasobscura.com/articles/100-wonders-dese...
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-03
      2635.0
    
    
      10
      www.atlasobscura.com/articles/100-wonders-devi...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-19
      2886.0
    
    
      11
      www.atlasobscura.com/articles/100-wonders-edis...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-12
      1600.0
    
    
      12
      www.atlasobscura.com/articles/100-wonders-its-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-28
      792.0
    
    
      13
      www.atlasobscura.com/articles/100-wonders-last...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-10
      3267.0
    
    
      14
      www.atlasobscura.com/articles/100-wonders-mode...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-30
      6290.0
    
    
      15
      www.atlasobscura.com/articles/100-wonders-necr...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-15
      8366.0
    
    
      16
      www.atlasobscura.com/articles/100-wonders-new-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-18
      10149.0
    
    
      17
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-23
      1625.0
    
    
      18
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-02-04
      998.0
    
    
      19
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-17
      3102.0
    
    
      20
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-26
      2396.0
    
    
      21
      www.atlasobscura.com/articles/100-wonders-the-...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-24
      2490.0
    
    
      22
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-25
      1294.0
    
    
      23
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-09
      2509.0
    
    
      24
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-08
      4845.0
    
    
      25
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-22
      2366.0
    
    
      26
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-10
      5815.0
    
    
      27
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-14
      1654.0
    
    
      28
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-18
      3708.0
    
    
      29
      www.atlasobscura.com/articles/100-wonders-the-...
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-27
      1841.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2783
      www.atlasobscura.com/articles/williamsburg-sav...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-02-12
      1101.0
    
    
      2784
      www.atlasobscura.com/articles/winters-effigies...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-12-17
      16565.0
    
    
      2785
      www.atlasobscura.com/articles/wishing-trees
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-07-30
      5333.0
    
    
      2786
      www.atlasobscura.com/articles/without-people-p...
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-02-17
      3408.0
    
    
      2787
      www.atlasobscura.com/articles/wolhusen-mortuar...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-01-19
      2469.0
    
    
      2788
      www.atlasobscura.com/articles/wonderland-lost-...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2013-05-16
      7380.0
    
    
      2789
      www.atlasobscura.com/articles/wonders-of-polar...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-01-28
      1883.0
    
    
      2790
      www.atlasobscura.com/articles/woody-guthries-w...
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-04-03
      3541.0
    
    
      2791
      www.atlasobscura.com/articles/woolly-mammoth-o...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2013-07-16
      1740.0
    
    
      2792
      www.atlasobscura.com/articles/working-at-a-coo...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-29
      6031.0
    
    
      2793
      www.atlasobscura.com/articles/world-record-fil...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-02
      3258.0
    
    
      2794
      www.atlasobscura.com/articles/world-s-largest-...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-12-04
      93.0
    
    
      2795
      www.atlasobscura.com/articles/world-s-oldest-b...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-10-15
      4814.0
    
    
      2796
      www.atlasobscura.com/articles/world-wingsuit-l...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-14
      1135.0
    
    
      2797
      www.atlasobscura.com/articles/worlds-fair-reli...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-01-07
      2233.0
    
    
      2798
      www.atlasobscura.com/articles/worldwide-scotch...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-06
      23143.0
    
    
      2799
      www.atlasobscura.com/articles/wrapping-armchai...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-02-03
      1409.0
    
    
      2800
      www.atlasobscura.com/articles/written-in-the-s...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-05-11
      2892.0
    
    
      2801
      www.atlasobscura.com/articles/wwii-to-syria-ho...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-12
      5503.0
    
    
      2802
      www.atlasobscura.com/articles/xylothek
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-22
      8876.0
    
    
      2803
      www.atlasobscura.com/articles/yarn-stores-cand...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-30
      11984.0
    
    
      2804
      www.atlasobscura.com/articles/you-can-now-take...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-15
      3507.0
    
    
      2805
      www.atlasobscura.com/articles/you-still-have-t...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-14
      13.0
    
    
      2806
      www.atlasobscura.com/articles/your-new-favorit...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-13
      2256.0
    
    
      2807
      www.atlasobscura.com/articles/your-ticket-to-t...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-13
      690.0
    
    
      2808
      www.atlasobscura.com/articles/youre-not-a-true...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-18
      823.0
    
    
      2809
      www.atlasobscura.com/articles/youve-visited-10...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-16
      1660.0
    
    
      2810
      www.atlasobscura.com/articles/zeroes-after-zer...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-01-07
      11902.0
    
    
      2811
      www.atlasobscura.com/articles/zombie-mines-hau...
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-16
      1676.0
    
    
      2812
      www.atlasobscura.com/articles/zzyzx-california...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-24
      43015.0
    
  

2813 rows × 986 columns



In [27]:

    
article_set['upper_quartile'] = [1 if x > 10000 else 0 for x in article_set.pageviews]



In [28]:

    
article_set.pageviews.plot(kind='hist', bins=100,title='Page View Distribution, All Content')









    Out[28]:





<matplotlib.axes._subplots.AxesSubplot at 0x1060f1510>



In [29]:

    
article_set['published'] = pd.to_datetime(article_set['published'])



In [30]:

    
article_set









    Out[30]:






  
    
      
      100-wonders
      19th-century
      2016-election
      30-rock
      31-days-of-halloween
      abandoned
      abandoned-amusement-parks
      abandoned-brooklyn
      abandoned-cemetaries
      abandoned-hospitals
      ...
      wunderkammer
      wwi
      wwii
      yehlui-geological-park
      yeti
      zombies
      zoos
      published
      pageviews
      upper_quartile
    
    
      tagged_url
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-01
      651.0
      0
    
    
      www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-09
      3505.0
      0
    
    
      www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-12
      840.0
      0
    
    
      www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-30
      4037.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-01-07
      1620.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-an-island-you-dont-want-to-visit
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-20
      4049.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-battleship-island
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-17
      2727.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-blood-falls
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-15
      1290.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-clown-motel
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-21
      1450.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-desertron
      1.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-03
      2635.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-devils-kettle
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-19
      2886.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-edisons-last-breath
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-12
      1600.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-its-taco-time
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-28
      792.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-last-tree-of-tenere
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-10
      3267.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-model-behavior
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-30
      6290.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-necropants
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-15
      8366.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-new-york-s-triangle-of-shame
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-18
      10149.0
      1
    
    
      www.atlasobscura.com/articles/100-wonders-the-arrow-stork
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-23
      1625.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-atomic-clock
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-02-04
      998.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-blue-lagoon-of-buxton
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-17
      3102.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-bone-church
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-26
      2396.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-cave-of-crystals
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-24
      2490.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-classiest-saint-relic-in-europe
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-25
      1294.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-color-of-control
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-09
      2509.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-dyatlov-incident
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-08
      4845.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-everlasting-lightning-storm
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-22
      2366.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-gates-of-hell
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-10
      5815.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-glowing-ocean
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-14
      1654.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-great-boston-molasses-flood
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-18
      3708.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-the-great-green-wall-of-africa
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-27
      1841.0
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      www.atlasobscura.com/articles/williamsburg-savings-bank-restoration
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-02-12
      1101.0
      0
    
    
      www.atlasobscura.com/articles/winters-effigies-the-deviant-history-of-the-snowman
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-12-17
      16565.0
      1
    
    
      www.atlasobscura.com/articles/wishing-trees
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-07-30
      5333.0
      0
    
    
      www.atlasobscura.com/articles/without-people-project
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-02-17
      3408.0
      0
    
    
      www.atlasobscura.com/articles/wolhusen-mortuary-chapel-where-real-skulls-join-a-dance-of-death
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-01-19
      2469.0
      0
    
    
      www.atlasobscura.com/articles/wonderland-lost-the-abandoned-beijing-amusement-park-is-razed
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2013-05-16
      7380.0
      0
    
    
      www.atlasobscura.com/articles/wonders-of-polar-architecture
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-01-28
      1883.0
      0
    
    
      www.atlasobscura.com/articles/woody-guthries-wardy-forty
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      1.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-04-03
      3541.0
      0
    
    
      www.atlasobscura.com/articles/woolly-mammoth-on-display-in-japan
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2013-07-16
      1740.0
      0
    
    
      www.atlasobscura.com/articles/working-at-a-cookie-factory-ruined-cookies-for-me-forever
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-29
      6031.0
      0
    
    
      www.atlasobscura.com/articles/world-record-filibuster-ends-after-192-hours-of-orwell-and-internet-comments
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-02
      3258.0
      0
    
    
      www.atlasobscura.com/articles/world-s-largest-manta-ray-trafficker-bust-in-indonesia
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-12-04
      93.0
      0
    
    
      www.atlasobscura.com/articles/world-s-oldest-botanical-gardens
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-10-15
      4814.0
      0
    
    
      www.atlasobscura.com/articles/world-wingsuit-league-china-grand-prix-2015
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-14
      1135.0
      0
    
    
      www.atlasobscura.com/articles/worlds-fair-relics-paris
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-01-07
      2233.0
      0
    
    
      www.atlasobscura.com/articles/worldwide-scotch-shortage-compounds-existing-bourbon-scarcity
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-06
      23143.0
      1
    
    
      www.atlasobscura.com/articles/wrapping-armchairs-in-wire-and-other-childhood-attempts-to-travel-in-time
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-02-03
      1409.0
      0
    
    
      www.atlasobscura.com/articles/written-in-the-skin-3-places-to-find-books-bound-in-skin
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-05-11
      2892.0
      0
    
    
      www.atlasobscura.com/articles/wwii-to-syria-how-seed-vaults-weather-wars
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-10-12
      5503.0
      0
    
    
      www.atlasobscura.com/articles/xylothek
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-22
      8876.0
      0
    
    
      www.atlasobscura.com/articles/yarn-stores-candy-shops-funeral-homes-and-more-of-the-uncategorizable-punny-businesses
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-30
      11984.0
      1
    
    
      www.atlasobscura.com/articles/you-can-now-take-your-pot-to-the-skies-in-oregon
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-15
      3507.0
      0
    
    
      www.atlasobscura.com/articles/you-still-have-time-to-apply-to-be-a-fulltime-ninja-in-japan
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-03-14
      13.0
      0
    
    
      www.atlasobscura.com/articles/your-new-favorite-honey-is-made-out-of-bug-poop-and-bee-vomit
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-13
      2256.0
      0
    
    
      www.atlasobscura.com/articles/your-ticket-to-the-1893-columbian-exposition
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-07-13
      690.0
      0
    
    
      www.atlasobscura.com/articles/youre-not-a-true-australian-until-youve-been-divebombed-by-a-magpie
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-18
      823.0
      0
    
    
      www.atlasobscura.com/articles/youve-visited-100-countries-join-the-club
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-11-16
      1660.0
      0
    
    
      www.atlasobscura.com/articles/zeroes-after-zeroes-the-worlds-highest-currencies
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-01-07
      11902.0
      1
    
    
      www.atlasobscura.com/articles/zombie-mines-haunt-the-landscape
      0.0
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-16
      1676.0
      0
    
    
      www.atlasobscura.com/articles/zzyzx-california-or-the-biggest-health-spa-scam-in-american-history
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-09-24
      43015.0
      1
    
  

2813 rows × 986 columns



In [31]:

    
article_set['year'] = pd.DatetimeIndex(article_set['published']).year



In [32]:

    
ax = article_set.boxplot(column='pageviews',by='year',figsize=(6,6),showfliers=False)
ax.set(title='PV distribution by year',ylabel='pageviews')









    Out[32]:





[<matplotlib.text.Text at 0x1181edb50>, <matplotlib.text.Text at 0x1181f9150>]

Articles published more recently have, on average, received much more traffic than older articles (this reflects growth and heavier distribution of the newer content). The drop in the mean as we move into 2016 is an artifact of the article's lifecycle not being complete.

Article lifecycle will be explored below.



In [33]:

    
yearly = article_set.set_index('published').resample('M').mean().plot(y='pageviews')
yearly.set(title='Total Pageviews By Month of Article Publication')









    Out[33]:





[<matplotlib.text.Text at 0x118f3e690>]

Let's import the time-series I created with a python script that asks GA for the daily time-series of Pageviewsof each article from publication date forward two years.



In [35]:

    
time_series = pd.read_csv('time-series.csv')



In [36]:

    
type(time_series)









    Out[36]:





pandas.core.frame.DataFrame



In [37]:

    
time_series = time_series.drop('Unnamed: 0',axis=1)

It was easier to collect the data from GA by looping over the columns in my original dataframe, but having each row be an article record is easier to work with now, so we transpose.



In [38]:

    
time_series = time_series.T



In [39]:

    
time_series.columns









    Out[39]:





RangeIndex(start=0, stop=731, step=1)



In [40]:

    
time_series['total'] = time_series.sum(axis=1)



In [41]:

    
time_series.head()









    Out[41]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      722
      723
      724
      725
      726
      727
      728
      729
      730
      total
    
  
  
    
      10-little-known-beaches-to-explore-in-the-last-days-of-summer
      2.0
      419.0
      203.0
      19.0
      4.0
      6.0
      2.0
      7.0
      4.0
      35.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      940.0
    
    
      10-of-the-greatest-overland-migrations-photos
      468.0
      368.0
      658.0
      325.0
      138.0
      40.0
      33.0
      77.0
      63.0
      21.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      4826.0
    
    
      10-places-12-year-old-me-would-love-to-live
      106.0
      762.0
      271.0
      132.0
      209.0
      96.0
      41.0
      15.0
      9.0
      9.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      4621.0
    
    
      10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton
      2186.0
      538.0
      209.0
      377.0
      92.0
      134.0
      80.0
      34.0
      18.0
      75.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      4482.0
    
    
      100-wonders-a-visit-with-a-frozen-dead-guy
      928.0
      272.0
      231.0
      87.0
      96.0
      40.0
      16.0
      11.0
      7.0
      7.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      2032.0
    
  

5 rows × 732 columns

Let's determine how many days post-publication it takes for an article to collect 90% of total pageviews.



In [42]:

    
time_series['days_to_90p']= [(time_series.iloc[x].expanding().sum() > time_series.iloc[x].total*.90).argmax() \
                                 for x in range(len(time_series))]



In [43]:

    
time_series.reset_index(inplace=True)



In [44]:

    
time_series.head(1)









    Out[44]:






  
    
      
      index
      0
      1
      2
      3
      4
      5
      6
      7
      8
      ...
      723
      724
      725
      726
      727
      728
      729
      730
      total
      days_to_90p
    
  
  
    
      0
      10-little-known-beaches-to-explore-in-the-last...
      2.0
      419.0
      203.0
      19.0
      4.0
      6.0
      2.0
      7.0
      4.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      940.0
      189
    
  

1 rows × 734 columns



In [45]:

    
time_series['index'] = ['www.atlasobscura.com/articles/' + x for x in time_series['index']]
time_series.set_index('index',inplace=True)
time_series = time_series.join(pageviews.published)
time_series.head(5)









    Out[45]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      724
      725
      726
      727
      728
      729
      730
      total
      days_to_90p
      published
    
    
      index
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      2.0
      419.0
      203.0
      19.0
      4.0
      6.0
      2.0
      7.0
      4.0
      35.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      940.0
      189
      2015-08-01
    
    
      www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos
      468.0
      368.0
      658.0
      325.0
      138.0
      40.0
      33.0
      77.0
      63.0
      21.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      4826.0
      230
      2015-06-09
    
    
      www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live
      106.0
      762.0
      271.0
      132.0
      209.0
      96.0
      41.0
      15.0
      9.0
      9.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      4621.0
      634
      2014-05-12
    
    
      www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton
      2186.0
      538.0
      209.0
      377.0
      92.0
      134.0
      80.0
      34.0
      18.0
      75.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      4482.0
      19
      2015-12-30
    
    
      www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy
      928.0
      272.0
      231.0
      87.0
      96.0
      40.0
      16.0
      11.0
      7.0
      7.0
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      2032.0
      45
      2016-01-07
    
  

5 rows × 734 columns



In [46]:

    
time_series['published'] = pd.to_datetime(time_series.published)



In [47]:

    
time_series['year_pub'] = pd.DatetimeIndex(time_series['published']).year



In [48]:

    
time_series.boxplot(column='days_to_90p',by='year_pub')









    Out[48]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d1c8250>



In [49]:

    
time_series.year_pub.value_counts(dropna=False)









    Out[49]:





 2015.0    1346
 2016.0     775
 2014.0     476
 2013.0     447
 2012.0      30
NaN          11
 2010.0       3
 2011.0       2
Name: year_pub, dtype: int64



In [50]:

    
time_series[['days_to_90p','total','year_pub']].corr()









    Out[50]:






  
    
      
      days_to_90p
      total
      year_pub
    
  
  
    
      days_to_90p
      1.000000
      -0.058821
      -0.742601
    
    
      total
      -0.058821
      1.000000
      0.092965
    
    
      year_pub
      -0.742601
      0.092965
      1.000000



In [403]:

    
#I DON'T KNOW WHY THIS WON'T WORK
time_series['30-day-PVs'] = [time_series.fillna(value=0).iloc[x,0:31].sum() for x in range(len(time_series))]



In [417]:

    
time_series['7-day-PVs'] = [time_series.fillna(value=0).iloc[x,0:8].sum() for x in range(len(time_series))]

Now let's look at the number of articles per tag (we will later join the two DataFrames above into one)



In [92]:

    
total_tagged= pd.DataFrame(data=article_set.sum(),columns = ['num_tagged'])



In [93]:

    
total_tagged.sort_values('num_tagged',ascending=False,inplace=True)



In [94]:

    
total_tagged.drop('pageviews',axis=0,inplace=True)



In [95]:

    
total_tagged[total_tagged.num_tagged >= 10].count()









    Out[95]:





num_tagged    199
dtype: int64



In [96]:

    
total_tagged[total_tagged.num_tagged <=5].index









    Out[96]:





Index([u'india', u'funeral-art', u'banks', u'bioluminescence', u'bars',
       u'assassination', u'utopias', u'flora', u'turkey', u'bicycles',
       ...
       u'earthquakes', u'pink', u'pigs', u'physics', u'edmund-hillary',
       u'education', u'philip-k-dick', u'pharmacy-museums', u'egypt',
       u'cybersecurity'],
      dtype='object', length=679)



In [124]:

    
#tag_analysis = article_set.drop(total_tagged[total_tagged.num_tagged < 5].index,axis=1)
#I'm resetting tag_analysis to contain all tags so I can manipulate later whenever I want. It makes it more clear.
tag_analysis = article_set



In [98]:

    
print tag_analysis.shape
tag_analysis.head()









    



(2813, 354)






    Out[98]:






  
    
      
      100-wonders
      31-days-of-halloween
      abandoned
      abandoned-amusement-parks
      abandoned-hospitals
      abandoned-insane-asylums
      abe-day
      aircraft
      airplanes
      airports
      ...
      world-s-fair
      world-s-oldest
      wunderkammer
      wwi
      wwii
      zombies
      published
      pageviews
      upper_quartile
      year
    
    
      tagged_url
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-08-01
      651.0
      0
      2015.0
    
    
      www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-09
      3505.0
      0
      2015.0
    
    
      www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-12
      840.0
      0
      2014.0
    
    
      www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-30
      4037.0
      0
      2015.0
    
    
      www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2016-01-07
      1620.0
      0
      2016.0
    
  

5 rows × 354 columns



In [60]:

    
tag_analysis.tail()
tag_analysis.to_csv('tag_analysis_ready.csv')



In [99]:

    
total_tagged.head(30)
print total_tagged.shape



In [100]:

    
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(interaction_only=True)



In [101]:

    
poly_df = pd.DataFrame(poly.fit_transform(tag_analysis.fillna(0).drop(['published','pageviews','upper_quartile','year'],axis=1)))



In [102]:

    
poly.n_output_features_









    Out[102]:





61426



In [103]:

    
total_tagged.ix['extra-mile']









    Out[103]:





num_tagged    16.0
Name: extra-mile, dtype: float64



In [104]:

    
regular_features = ['places-you-can-no-longer-go','100-wonders','extra-mile','video-wonders','news','features','columns',
                    'found','animals','fleeting-wonders','visual','other-capitals-of-the-world','video','art','list','objects-of-intrigue',
                    'maps','morbid-monday','female-explorers','naturecultures']



In [125]:

    
total_tagged[total_tagged.num_tagged >10].shape









    Out[125]:





(185, 1)



In [304]:

    
interactions = pd.DataFrame()



In [305]:

    
for item in regular_features:
    for column in tag_analysis.drop(['published','pageviews','upper_quartile','year'],axis=1).drop(
         total_tagged[total_tagged.num_tagged < 10].index,axis=1).columns:
        interactions[(item + '_' + column)] = tag_analysis[item] + tag_analysis[column]
#Just sum the row and column and then turn any 2s into 1s and 1s into zeros.



In [306]:

    
def correct_values(x):
    if x == 2.0:
        return 1
    elif x == 1.0:
        return 0
    else:
        return 0
for item in interactions.columns:
    interactions[item] = interactions[item].apply(correct_values)



In [307]:

    
interactions.head(2)









    Out[307]:






  
    
      
      places-you-can-no-longer-go_100-wonders
      places-you-can-no-longer-go_31-days-of-halloween
      places-you-can-no-longer-go_abandoned
      places-you-can-no-longer-go_abandoned-insane-asylums
      places-you-can-no-longer-go_aircraft
      places-you-can-no-longer-go_airplanes
      places-you-can-no-longer-go_amusement-parks
      places-you-can-no-longer-go_ancient
      places-you-can-no-longer-go_animal-week
      places-you-can-no-longer-go_animals
      ...
      naturecultures_volcanoes
      naturecultures_war
      naturecultures_water
      naturecultures_watery-wonders
      naturecultures_weird-weather-phenomena
      naturecultures_whales
      naturecultures_witchcraft
      naturecultures_women
      naturecultures_world-s-fair
      naturecultures_wwii
    
    
      tagged_url
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

2 rows × 3940 columns



In [308]:

    
tagged_total = pd.DataFrame(data =interactions.sum(),columns=['num_tagged'])
tagged_total = tagged_total.sort_values('num_tagged',ascending=False)



In [309]:

    
identity_tags = tagged_total[0:26].index



In [310]:

    
interactions = interactions.drop(identity_tags,axis=1)



In [311]:

    
tagged_total = pd.DataFrame(data =interactions.sum(),columns=['num_tagged'])
tagged_total = tagged_total.sort_values('num_tagged',ascending=False)
tagged_total.head(10)









    Out[311]:






  
    
      
      num_tagged
    
  
  
    
      news_space
      43
    
    
      animals_features
      38
    
    
      news_animals
      38
    
    
      animals_news
      38
    
    
      features_animals
      38
    
    
      100-wonders_video
      37
    
    
      video_100-wonders
      37
    
    
      columns_features
      35
    
    
      features_columns
      35
    
    
      columns_map-monday
      33



In [312]:

    
#DO I WANT TO DROP THE EMPTY COLUMNS?
#for item in interactions.columns:
 #   if interactions[item].sum == 0:
#      interactions = interactions.drop(item,axis=1)



In [313]:

    
interactions.head(10)









    Out[313]:






  
    
      
      places-you-can-no-longer-go_100-wonders
      places-you-can-no-longer-go_31-days-of-halloween
      places-you-can-no-longer-go_abandoned
      places-you-can-no-longer-go_abandoned-insane-asylums
      places-you-can-no-longer-go_aircraft
      places-you-can-no-longer-go_airplanes
      places-you-can-no-longer-go_amusement-parks
      places-you-can-no-longer-go_ancient
      places-you-can-no-longer-go_animal-week
      places-you-can-no-longer-go_animals
      ...
      naturecultures_volcanoes
      naturecultures_war
      naturecultures_water
      naturecultures_watery-wonders
      naturecultures_weird-weather-phenomena
      naturecultures_whales
      naturecultures_witchcraft
      naturecultures_women
      naturecultures_world-s-fair
      naturecultures_wwii
    
    
      tagged_url
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-an-island-you-dont-want-to-visit
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-battleship-island
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-blood-falls
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-clown-motel
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-desertron
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

10 rows × 3914 columns



In [314]:

    
interactions = interactions.join(pageviews)



In [315]:

    
#drop empty cols
def drop_zero_cols(df):
    for item in df.columns:
        if df[item].sum() == 0:
            df = df.drop(item,axis=1)
        else:
            continue
    return df



In [316]:

    
interactions = drop_zero_cols(interactions.fillna(0).drop(['published','pageviews'],axis=1))
interactions = interactions.join(pageviews)



In [317]:

    
interactions.head(1)









    Out[317]:






  
    
      
      places-you-can-no-longer-go_castles
      places-you-can-no-longer-go_cemeteries
      places-you-can-no-longer-go_cheat-week
      places-you-can-no-longer-go_escape-week
      places-you-can-no-longer-go_film
      places-you-can-no-longer-go_garbage
      places-you-can-no-longer-go_garbage-week
      places-you-can-no-longer-go_islands
      places-you-can-no-longer-go_japan
      places-you-can-no-longer-go_nazis
      ...
      naturecultures_science
      naturecultures_sounds
      naturecultures_space
      naturecultures_time-week
      naturecultures_transportation
      naturecultures_trees
      naturecultures_underground-week
      naturecultures_wwii
      published
      pageviews
    
    
      tagged_url
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      2015-08-01
      651.0
    
  

1 rows × 1236 columns



In [318]:

    
interaction_totals = pd.DataFrame(interactions.sum().sort_values(ascending=False),columns=['num_tagged'])



In [345]:

    
interaction_totals[interaction_totals.num_tagged < 4].shape









    Out[345]:





(1008, 1)



In [346]:

    
interactions_analysis = interactions.drop(interaction_totals[interaction_totals.num_tagged < 4].index,axis=1)



In [347]:

    
interactions_analysis.head()









    Out[347]:






  
    
      
      100-wonders_disaster-areas
      100-wonders_disasters
      100-wonders_science
      100-wonders_video
      extra-mile_columns
      extra-mile_extra-mile
      video-wonders_animals
      video-wonders_australia
      video-wonders_sports
      news_airplanes
      ...
      morbid-monday_relics
      female-explorers_columns
      female-explorers_female-explorers
      female-explorers_kickass-women
      naturecultures_animals
      naturecultures_columns
      naturecultures_features
      naturecultures_naturecultures
      published
      pageviews
    
    
      tagged_url
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      2015-08-01
      651.0
    
    
      www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      2015-06-09
      3505.0
    
    
      www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      2014-05-12
      840.0
    
    
      www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      2015-12-30
      4037.0
    
    
      www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      2016-01-07
      1620.0
    
  

5 rows × 228 columns



In [348]:

    
#Check whether number of Aggregated stories published per day has an impact on average/total Day 0 - 1 traffic.



In [349]:

    
from sklearn import linear_model
from sklearn import metrics
from sklearn import cross_validation



In [350]:

    
interactions_analysis['upper_quartile'] = [1 if x > 10000 else 0 for x in interactions.pageviews]



In [351]:

    
interactions_analysis['twenty_thousand'] = [1 if x > 20000 else 0 for x in interactions.pageviews]



In [352]:

    
y = interactions_analysis.upper_quartile
X = interactions_analysis.drop(['pageviews','published','upper_quartile','twenty_thousand'],axis=1)



In [353]:

    
kf = cross_validation.KFold(len(interactions_analysis),n_folds=5)
scores = []
for train_index, test_index in kf:
    lr = linear_model.LogisticRegression().fit(X.iloc[train_index],y.iloc[train_index])
    scores.append(lr.score(X.iloc[test_index],y.iloc[test_index]))
print "average accuracy for LogisticRegression is", np.mean(scores)
print "average of the set is: ", np.mean(y)









    



average accuracy for LogisticRegression is 0.846046535148
average of the set is:  0.151795236402



In [354]:

    
interactions_lr_scores = lr.predict_proba(X)[:,1]



In [355]:

    
print metrics.roc_auc_score(y,interactions_lr_scores)









    



0.632145261881



In [356]:

    
interactions_probabilities = pd.DataFrame(zip(X.columns,interactions_lr_scores),columns=['tags','probabilities'])



In [357]:

    
interactions_probabilities.sort_values('probabilities',ascending=False)









    Out[357]:






  
    
      
      tags
      probabilities
    
  
  
    
      109
      features_time-week
      0.436462
    
    
      53
      features_animals
      0.400695
    
    
      57
      features_birds
      0.368227
    
    
      149
      animals_animal-week
      0.296609
    
    
      111
      features_underground-week
      0.283282
    
    
      116
      features_watery-wonders
      0.283282
    
    
      150
      animals_birds
      0.270520
    
    
      209
      maps_exploration
      0.261088
    
    
      110
      features_tunnels
      0.261088
    
    
      24
      news_features
      0.183567
    
    
      223
      naturecultures_columns
      0.178019
    
    
      190
      art_fleeting-wonders
      0.175223
    
    
      42
      news_sculptures
      0.169818
    
    
      48
      news_trees
      0.169818
    
    
      26
      news_fossils
      0.169818
    
    
      158
      animals_found
      0.163157
    
    
      163
      animals_video
      0.155569
    
    
      138
      found_archaeology
      0.155189
    
    
      118
      features_women
      0.154375
    
    
      168
      fleeting-wonders_food
      0.151925
    
    
      101
      features_sculptures
      0.148187
    
    
      69
      features_fashion
      0.147209
    
    
      171
      fleeting-wonders_sports
      0.141600
    
    
      145
      found_science
      0.129710
    
    
      146
      found_shipwrecks
      0.129710
    
    
      147
      found_space
      0.129710
    
    
      148
      found_war
      0.129710
    
    
      154
      animals_features
      0.129710
    
    
      151
      animals_cats
      0.129710
    
    
      153
      animals_dogs
      0.129710
    
    
      ...
      ...
      ...
    
    
      50
      news_volcanoes
      0.058617
    
    
      49
      news_underwater
      0.058617
    
    
      13
      news_architecture
      0.058617
    
    
      14
      news_art
      0.058617
    
    
      15
      news_australia
      0.058617
    
    
      16
      news_birds
      0.058617
    
    
      17
      news_books
      0.058617
    
    
      37
      news_oceans
      0.058617
    
    
      39
      news_politics
      0.058617
    
    
      20
      news_crime-and-punishment
      0.058617
    
    
      29
      news_insects
      0.058617
    
    
      36
      news_nasa
      0.058617
    
    
      40
      news_religion
      0.058617
    
    
      34
      news_music
      0.058617
    
    
      41
      news_science
      0.058617
    
    
      32
      news_literature
      0.058617
    
    
      31
      news_japan
      0.058617
    
    
      30
      news_islands
      0.058617
    
    
      43
      news_shipwrecks
      0.058617
    
    
      25
      news_food
      0.058617
    
    
      44
      news_snakes
      0.058617
    
    
      45
      news_space
      0.058617
    
    
      66
      features_crime-and-punishment
      0.053248
    
    
      105
      features_sports
      0.040151
    
    
      9
      news_airplanes
      0.040049
    
    
      46
      news_sports
      0.040049
    
    
      27
      news_garbage-week
      0.040049
    
    
      23
      news_dogs
      0.040049
    
    
      18
      news_churches
      0.040049
    
    
      143
      found_maps
      0.036773
    
  

226 rows × 2 columns



In [475]:

    
interaction_totals.head(2)









    Out[475]:






  
    
      
      num_tagged
    
  
  
    
      pageviews
      21941190.0
    
    
      news_space
      43.0



In [469]:

    
def split_tag(x):
    return x.split('_')[1]
interactions_probabilities = interactions_probabilities.reset_index()
interactions_probabilities['subtag'] = interactions_probabilities.tags.apply(split_tag)



In [477]:

    
interactions_probabilities = interactions_probabilities.sort_values(['tags','probabilities'],ascending=[1, 0])



In [471]:

    
interactions_probabilities = interactions_probabilities.set_index('tags').join(interaction_totals)









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-471-09b7b5167df0> in <module>()
----> 1 interactions_probabilities = interactions_probabilities.set_index('tags').join(interaction_totals)

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in join(self, other, on, how, lsuffix, rsuffix, sort)
   4367         # For SparseDataFrame's benefit
   4368         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 4369                                  rsuffix=rsuffix, sort=sort)
   4370 
   4371     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   4381             return merge(self, other, left_on=on, how=how,
   4382                          left_index=on is None, right_index=True,
-> 4383                          suffixes=(lsuffix, rsuffix), sort=sort)
   4384         else:
   4385             if on is not None:

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator)
     33                          right_index=right_index, sort=sort, suffixes=suffixes,
     34                          copy=copy, indicator=indicator)
---> 35     return op.get_result()
     36 if __debug__:
     37     merge.__doc__ = _merge_doc % '\nleft : DataFrame'

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in get_result(self)
    210 
    211         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,
--> 212                                                      rdata.items, rsuf)
    213 
    214         lindexers = {1: left_indexer} if left_indexer is not None else {}

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   4372         if not lsuffix and not rsuffix:
   4373             raise ValueError('columns overlap but no suffix specified: %s' %
-> 4374                              to_rename)
   4375 
   4376         def lrenamer(x):

ValueError: columns overlap but no suffix specified: Index([u'num_tagged'], dtype='object')



In [478]:

    
interactions_probabilities









    Out[478]:






  
    
      
      tags
      probabilities
      subtag
      num_tagged
    
  
  
    
      184
      100-wonders_disaster-areas
      0.099160
      disaster-areas
      6.0
    
    
      23
      100-wonders_disasters
      0.129710
      disasters
      6.0
    
    
      24
      100-wonders_science
      0.129710
      science
      5.0
    
    
      25
      100-wonders_video
      0.129710
      video
      37.0
    
    
      3
      animals_animal-week
      0.296609
      animal-week
      10.0
    
    
      6
      animals_birds
      0.270520
      birds
      8.0
    
    
      26
      animals_cats
      0.129710
      cats
      8.0
    
    
      176
      animals_columns
      0.121526
      columns
      8.0
    
    
      27
      animals_dogs
      0.129710
      dogs
      7.0
    
    
      28
      animals_features
      0.129710
      features
      38.0
    
    
      29
      animals_fleeting-wonders
      0.129710
      fleeting-wonders
      12.0
    
    
      30
      animals_food
      0.129710
      food
      4.0
    
    
      31
      animals_fossils
      0.129710
      fossils
      4.0
    
    
      15
      animals_found
      0.163157
      found
      21.0
    
    
      32
      animals_list
      0.129710
      list
      8.0
    
    
      187
      animals_naturecultures
      0.085154
      naturecultures
      6.0
    
    
      188
      animals_news
      0.082413
      news
      38.0
    
    
      33
      animals_oceans
      0.129710
      oceans
      7.0
    
    
      16
      animals_video
      0.155569
      video
      8.0
    
    
      178
      animals_video-wonders
      0.116927
      video-wonders
      6.0
    
    
      34
      art_columns
      0.129710
      columns
      6.0
    
    
      191
      art_features
      0.077260
      features
      12.0
    
    
      11
      art_fleeting-wonders
      0.175223
      fleeting-wonders
      4.0
    
    
      35
      art_libraries
      0.129710
      libraries
      5.0
    
    
      36
      art_museums
      0.129710
      museums
      4.0
    
    
      37
      art_museums-and-collections
      0.129710
      museums-and-collections
      4.0
    
    
      38
      art_news
      0.129710
      news
      11.0
    
    
      39
      art_sculptures
      0.129710
      sculptures
      6.0
    
    
      40
      art_visual
      0.129710
      visual
      16.0
    
    
      41
      columns_animals
      0.129710
      animals
      8.0
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      212
      news_snakes
      0.058617
      snakes
      5.0
    
    
      213
      news_space
      0.058617
      space
      43.0
    
    
      224
      news_sports
      0.040049
      sports
      7.0
    
    
      151
      news_statues
      0.129710
      statues
      5.0
    
    
      14
      news_trees
      0.169818
      trees
      5.0
    
    
      214
      news_underwater
      0.058617
      underwater
      7.0
    
    
      215
      news_volcanoes
      0.058617
      volcanoes
      5.0
    
    
      152
      news_war
      0.129710
      war
      4.0
    
    
      153
      news_water
      0.129710
      water
      4.0
    
    
      154
      objects-of-intrigue_features
      0.129710
      features
      7.0
    
    
      155
      objects-of-intrigue_space
      0.129710
      space
      5.0
    
    
      156
      other-capitals-of-the-world_other-capitals-of-...
      0.129710
      other-capitals-of-the-world
      12.0
    
    
      216
      video-wonders_animals
      0.058617
      animals
      6.0
    
    
      217
      video-wonders_australia
      0.058617
      australia
      4.0
    
    
      163
      video-wonders_sports
      0.129710
      sports
      5.0
    
    
      157
      video_100-wonders
      0.129710
      100-wonders
      37.0
    
    
      158
      video_animals
      0.129710
      animals
      8.0
    
    
      159
      video_disaster-areas
      0.129710
      disaster-areas
      5.0
    
    
      160
      video_disasters
      0.129710
      disasters
      5.0
    
    
      161
      video_science
      0.129710
      science
      5.0
    
    
      162
      video_sports
      0.129710
      sports
      5.0
    
    
      164
      visual_abandoned
      0.129710
      abandoned
      6.0
    
    
      165
      visual_architecture
      0.129710
      architecture
      9.0
    
    
      166
      visual_art
      0.129710
      art
      16.0
    
    
      179
      visual_features
      0.110809
      features
      12.0
    
    
      190
      visual_list
      0.077772
      list
      22.0
    
    
      167
      visual_photo-of-the-week
      0.129710
      photo-of-the-week
      10.0
    
    
      168
      visual_photography
      0.129710
      photography
      20.0
    
    
      169
      visual_soviet
      0.129710
      soviet
      6.0
    
    
      170
      visual_space
      0.129710
      space
      9.0
    
  

226 rows × 4 columns



In [ ]:



In [ ]:



In [567]:

    
interactions_probabilities['pageviews'] = [sum(interactions['pageviews'][interactions[item]==1]) for item in interactions_probabilities.tags]



In [570]:

    
interactions_probabilities['mean-PVs'] = interactions_probabilities['pageviews'] // interactions_probabilities['num_tagged']



In [579]:

    
regular_features









    Out[579]:





['places you can no longer go',
 '100 wonders',
 'extra mile',
 'video wonders',
 'news',
 'features',
 'columns',
 'found',
 'animals',
 'fleeting wonders',
 'visual',
 'other capitals of the world',
 'video',
 'art',
 'list',
 'objects of intrigue',
 'maps',
 'morbid monday',
 'female explorers',
 'naturecultures']



In [623]:

    
interactions_probabilities[interactions_probabilities.tags.str.contains('features')==True].sort_values('mean-PVs',
                                                                                                   ascending = False)









    Out[623]:






  
    
      
      tags
      probabilities
      subtag
      num_tagged
      pageviews
      mean-PVs
    
  
  
    
      75
      features_linguistics
      0.129710
      linguistics
      4.0
      207623.0
      51905.0
    
    
      82
      features_miracles-week
      0.129710
      miracles-week
      8.0
      390486.0
      48810.0
    
    
      63
      features_computers
      0.129710
      computers
      7.0
      263227.0
      37603.0
    
    
      91
      features_plants
      0.129710
      plants
      5.0
      145866.0
      29173.0
    
    
      66
      features_film
      0.129710
      film
      14.0
      384072.0
      27433.0
    
    
      100
      features_television
      0.129710
      television
      8.0
      200192.0
      25024.0
    
    
      191
      art_features
      0.077260
      features
      12.0
      284906.0
      23742.0
    
    
      192
      features_art
      0.061562
      art
      12.0
      284906.0
      23742.0
    
    
      74
      features_language
      0.129710
      language
      4.0
      87118.0
      21779.0
    
    
      0
      features_time-week
      0.436462
      time-week
      10.0
      171105.0
      17110.0
    
    
      101
      features_video-games
      0.129710
      video-games
      9.0
      151971.0
      16885.0
    
    
      95
      features_science-fiction
      0.129710
      science-fiction
      4.0
      66197.0
      16549.0
    
    
      106
      features_wwii
      0.129710
      wwii
      6.0
      84399.0
      14066.0
    
    
      58
      features_books
      0.129710
      books
      8.0
      107780.0
      13472.0
    
    
      56
      features_architecture
      0.129710
      architecture
      6.0
      69250.0
      11541.0
    
    
      5
      features_watery-wonders
      0.283282
      watery-wonders
      4.0
      43674.0
      10918.0
    
    
      87
      features_naturecultures
      0.129710
      naturecultures
      26.0
      279450.0
      10748.0
    
    
      173
      naturecultures_features
      0.128812
      features
      26.0
      279450.0
      10748.0
    
    
      85
      features_murder
      0.129710
      murder
      4.0
      42335.0
      10583.0
    
    
      68
      features_games
      0.129710
      games
      8.0
      84261.0
      10532.0
    
    
      59
      features_cats
      0.129710
      cats
      6.0
      59547.0
      9924.0
    
    
      62
      features_columns
      0.129710
      columns
      35.0
      339985.0
      9713.0
    
    
      46
      columns_features
      0.129710
      features
      35.0
      339985.0
      9713.0
    
    
      81
      features_military
      0.129710
      military
      6.0
      55619.0
      9269.0
    
    
      185
      features_religion
      0.098450
      religion
      5.0
      44193.0
      8838.0
    
    
      177
      features_space
      0.119270
      space
      8.0
      67853.0
      8481.0
    
    
      180
      features_crime
      0.107340
      crime
      6.0
      49116.0
      8186.0
    
    
      4
      features_underground-week
      0.283282
      underground-week
      8.0
      65102.0
      8137.0
    
    
      77
      features_literature
      0.129710
      literature
      9.0
      68614.0
      7623.0
    
    
      102
      features_visual
      0.129710
      visual
      12.0
      89421.0
      7451.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      84
      features_monsters
      0.129710
      monsters
      6.0
      35139.0
      5856.0
    
    
      105
      features_witchcraft
      0.129710
      witchcraft
      4.0
      23395.0
      5848.0
    
    
      73
      features_kickass-women
      0.129710
      kickass-women
      5.0
      29003.0
      5800.0
    
    
      21
      features_fashion
      0.147209
      fashion
      5.0
      28595.0
      5719.0
    
    
      94
      features_science
      0.129710
      science
      15.0
      82426.0
      5495.0
    
    
      64
      features_dinosaurs
      0.129710
      dinosaurs
      4.0
      21218.0
      5304.0
    
    
      2
      features_birds
      0.368227
      birds
      9.0
      47373.0
      5263.0
    
    
      88
      features_new-york-city
      0.129710
      new-york-city
      6.0
      31498.0
      5249.0
    
    
      86
      features_music
      0.129710
      music
      12.0
      61632.0
      5136.0
    
    
      80
      features_medicine
      0.129710
      medicine
      5.0
      21993.0
      4398.0
    
    
      104
      features_water
      0.129710
      water
      7.0
      30590.0
      4370.0
    
    
      69
      features_garbage
      0.129710
      garbage
      4.0
      17274.0
      4318.0
    
    
      9
      news_features
      0.183567
      features
      6.0
      25507.0
      4251.0
    
    
      89
      features_news
      0.129710
      news
      6.0
      25507.0
      4251.0
    
    
      20
      features_sculptures
      0.148187
      sculptures
      5.0
      20346.0
      4069.0
    
    
      99
      features_technology
      0.129710
      technology
      4.0
      15892.0
      3973.0
    
    
      57
      features_birdweek
      0.129710
      birdweek
      9.0
      35752.0
      3972.0
    
    
      98
      features_statues
      0.129710
      statues
      4.0
      15627.0
      3906.0
    
    
      60
      features_cheat-week
      0.129710
      cheat-week
      8.0
      31149.0
      3893.0
    
    
      103
      features_war
      0.129710
      war
      5.0
      18591.0
      3718.0
    
    
      92
      features_politics
      0.129710
      politics
      14.0
      50899.0
      3635.0
    
    
      219
      features_sports
      0.040151
      sports
      13.0
      40004.0
      3077.0
    
    
      93
      features_presidents
      0.129710
      presidents
      4.0
      12032.0
      3008.0
    
    
      7
      features_tunnels
      0.261088
      tunnels
      4.0
      11691.0
      2922.0
    
    
      61
      features_china
      0.129710
      china
      9.0
      17069.0
      1896.0
    
    
      71
      features_halloween
      0.129710
      halloween
      4.0
      6578.0
      1644.0
    
    
      97
      features_sounds
      0.129710
      sounds
      4.0
      5630.0
      1407.0
    
    
      96
      features_snow
      0.129710
      snow
      4.0
      4647.0
      1161.0
    
    
      90
      features_objects-of-intrigue
      0.129710
      objects-of-intrigue
      7.0
      NaN
      NaN
    
    
      154
      objects-of-intrigue_features
      0.129710
      features
      7.0
      NaN
      NaN
    
  

76 rows × 6 columns



In [ ]:

    
interactions_probabilities.sort_values('probabilities',ascending = False)



In [625]:

    
np.mean(interactions.pageviews)









    Out[625]:





7938.2018813314035



In [620]:

    
#I took the dashes out. Have to add back for this
fix_regular_features = [x.replace(' ','-') for x in regular_features]
fig,axes=plt.subplots(figsize=(10,10))
for item, name in enumerate(fix_regular_features):
    interactions.plot(x=interactions['pageviews'][interactions.columns.str.contains(name)==True],kind='box',ax=item)
plt.show()









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-620-866e8988a9c8> in <module>()
      3 fig,axes=plt.subplots(figsize=(10,10))
      4 for item, name in enumerate(fix_regular_features):
----> 5     interactions.plot(x=interactions['pageviews'][interactions.columns.str.contains(name)==True],kind='boxplot',ax=item)
      6 plt.show()

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in __call__(self, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds)
   3735                           fontsize=fontsize, colormap=colormap, table=table,
   3736                           yerr=yerr, xerr=xerr, secondary_y=secondary_y,
-> 3737                           sort_columns=sort_columns, **kwds)
   3738     __call__.__doc__ = plot_frame.__doc__
   3739 

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in plot_frame(data, x, y, kind, ax, subplots, sharex, sharey, layout, figsize, use_index, title, grid, legend, style, logx, logy, loglog, xticks, yticks, xlim, ylim, rot, fontsize, colormap, table, yerr, xerr, secondary_y, sort_columns, **kwds)
   2609                  yerr=yerr, xerr=xerr,
   2610                  secondary_y=secondary_y, sort_columns=sort_columns,
-> 2611                  **kwds)
   2612 
   2613 

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/tools/plotting.pyc in _plot(data, x, y, subplots, ax, kind, **kwds)
   2388         klass = _plot_klass[kind]
   2389     else:
-> 2390         raise ValueError("%r is not a valid plot kind" % kind)
   2391 
   2392     from pandas import DataFrame

ValueError: 'boxplot' is not a valid plot kind



In [453]:

    
#doublecheck my work on pageviews vs num-published
pub_volume = tag_analysis[['published','pageviews']]
pub_volume['num_pubbed'] = 1
pub_volume['published'] = pd.to_datetime(pub_volume.published)
pub_volume = pub_volume.set_index('published')









    



/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [454]:

    
pub_volume.head(10)



In [455]:

    
pub_volume = pub_volume.resample('M').sum().dropna()



In [456]:

    
pub_volume['year'] = pub_volume.index.year



In [457]:

    
pub_volume[pub_volume.index.year >=2015].corr()









    Out[457]:






  
    
      
      pageviews
      num_pubbed
      year
    
  
  
    
      pageviews
      1.000000
      0.926886
      0.506975
    
    
      num_pubbed
      0.926886
      1.000000
      0.650054
    
    
      year
      0.506975
      0.650054
      1.000000



In [458]:

    
pub_volume[pub_volume.index.year >=2015].plot(kind='scatter',x='num_pubbed',y='pageviews')









    Out[458]:





<matplotlib.axes._subplots.AxesSubplot at 0x138e54750>



In [459]:

    
import seaborn as sns
ax = sns.regplot(x='num_pubbed',y='pageviews',data=pub_volume)

Now I'm going to try this with the time series 30days pvs



In [446]:

    
#doublecheck my work on pageviews vs num-published
pub_volume = time_series[['published','7-day-PVs']]
pub_volume['num_pubbed'] = 1
pub_volume['published'] = pd.to_datetime(pub_volume.published)
pub_volume = pub_volume.set_index('published')









    



/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/Mike/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [447]:

    
pub_volume.head(10)



In [448]:

    
num_holder = pub_volume.resample('D').sum().dropna().drop('7-day-PVs',axis=1)
pub_volume = pub_volume.resample('D').sum().dropna().drop('num_pubbed',axis=1)
pub_volume = pub_volume.join(num_holder)
pub_volume['year'] = pub_volume.index.year
pub_volume[pub_volume.index.year >=2015].corr()









    Out[448]:






  
    
      
      7-day-PVs
      num_pubbed
      year
    
  
  
    
      7-day-PVs
      1.000000
      0.466825
      0.158331
    
    
      num_pubbed
      0.466825
      1.000000
      0.380993
    
    
      year
      0.158331
      0.380993
      1.000000



In [451]:

    
pub_volume[pub_volume.index >='2016-01-01'].plot(kind='scatter',x='num_pubbed',y='7-day-PVs',title='7-Day PVs')









    Out[451]:





<matplotlib.axes._subplots.AxesSubplot at 0x137c1f5d0>



In [452]:

    
import seaborn as sns
ax = sns.regplot(x='num_pubbed',y='7-day-PVs',data=pub_volume)

Let's check average performance when just looking at Simplereach Tag data



In [540]:

    
simplereach = pd.read_csv('simplereach-tags.csv')



In [541]:

    
simplereach.head(1)









    Out[541]:






  
    
      
      Tag
      Page Views
      Social Actions
      Social Referrals
      Facebook Actions
      Facebook CommentsBox
      Facebook Likes
      Facebook Shares
      Facebook Comments
      Twitter Actions
      ...
      Desktop Reddit Referrals
      Mobile Delicious Referrals
      Tablet Delicious Referrals
      Desktop Delicious Referrals
      Mobile Pinterest Referrals
      Tablet Pinterest Referrals
      Desktop Pinterest Referrals
      Mobile Google Plus Referrals
      Tablet Google Plus Referrals
      Desktop Google Plus Referrals
    
  
  
    
      0
      features
      5009360
      716287
      2337637
      659298
      0
      431705
      126757
      100830
      46756
      ...
      163987
      21
      6
      248
      808
      413
      696
      1523
      428
      2179
    
  

1 rows × 59 columns



In [542]:

    
simplereach = simplereach.set_index('Tag')



In [546]:

    
total_tagged2 = total_tagged



In [547]:

    
total_tagged2.head(4)









    Out[547]:






  
    
      
      num_tagged
    
  
  
    
      year
      5568818.0
    
    
      news
      461.0
    
    
      upper_quartile
      427.0
    
    
      features
      356.0



In [548]:

    
total_tagged2.index = [x.replace('-',' ') for x in total_tagged.index]

simplereach = simplereach.join(total_tagged2)



In [549]:

    
simplereach['mean-PVs'] = simplereach['Page Views'] // simplereach['num_tagged']
simplereach['mean-shares'] = simplereach['Facebook Shares'] // simplereach['num_tagged']



In [550]:

    
simplereach = simplereach[['mean-PVs','mean-shares','num_tagged']]



In [622]:

    
simplereach[simplereach['num_tagged'] > 5].sort_values('mean-PVs',ascending=False)









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-622-46d81dbbd256> in <module>()
----> 1 simplereach['space'][(simplereach['num_tagged'] > 5)].sort_values('mean-PVs',ascending=False)

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1990             return self._getitem_multilevel(key)
   1991         else:
-> 1992             return self._getitem_column(key)
   1993 
   1994     def _getitem_column(self, key):

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1997         # get column
   1998         if self.columns.is_unique:
-> 1999             return self._get_item_cache(key)
   2000 
   2001         # duplicate columns & possible reduce dimensionality

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1343         res = cache.get(item)
   1344         if res is None:
-> 1345             values = self._data.get(item)
   1346             res = self._box_item_values(item, values)
   1347             cache[item] = res

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   3223 
   3224             if not isnull(item):
-> 3225                 loc = self.items.get_loc(item)
   3226             else:
   3227                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/Mike/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   1876                 return self._engine.get_loc(key)
   1877             except KeyError:
-> 1878                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1879 
   1880         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 'space'



In [554]:

    
#regular_features = [x.replace('-',' ') for x in regular_features]
simplereach.ix[regular_features].sort_values('mean-PVs',ascending=False)









    Out[554]:






  
    
      
      mean-PVs
      mean-shares
      num_tagged
    
    
      Tag
      
      
      
    
  
  
    
      maps
      18568.0
      545.0
      60.0
    
    
      naturecultures
      14334.0
      314.0
      27.0
    
    
      features
      14071.0
      356.0
      356.0
    
    
      other capitals of the world
      13825.0
      391.0
      12.0
    
    
      visual
      12602.0
      435.0
      117.0
    
    
      video wonders
      12324.0
      329.0
      47.0
    
    
      video
      10834.0
      231.0
      102.0
    
    
      list
      8637.0
      195.0
      85.0
    
    
      news
      8146.0
      225.0
      461.0
    
    
      extra mile
      7989.0
      248.0
      16.0
    
    
      found
      7756.0
      241.0
      167.0
    
    
      columns
      7465.0
      222.0
      212.0
    
    
      female explorers
      6573.0
      343.0
      15.0
    
    
      animals
      6086.0
      259.0
      165.0
    
    
      100 wonders
      5687.0
      90.0
      44.0
    
    
      objects of intrigue
      5029.0
      212.0
      79.0
    
    
      fleeting wonders
      4452.0
      115.0
      156.0
    
    
      art
      4317.0
      139.0
      92.0
    
    
      places you can no longer go
      3948.0
      38.0
      44.0
    
    
      morbid monday
      1537.0
      32.0
      45.0



In [ ]:



In [ ]:

Let's run some regression analysis on our tag_analysis DataFrame



In [135]:

    
from sklearn import linear_model



In [136]:

    
from sklearn import metrics



In [137]:

    
tag_analysis.fillna(value=0,inplace=True)



In [138]:

    
y = tag_analysis.upper_quartile
X = tag_analysis.drop(['pageviews','published','upper_quartile'],axis=1)



In [139]:

    
from sklearn import cross_validation



In [140]:

    
kf = cross_validation.KFold(len(tag_analysis),n_folds=5)
scores = []
for train_index, test_index in kf:
    lr = linear_model.LogisticRegression().fit(X.iloc[train_index],y.iloc[train_index])
    scores.append(lr.score(X.iloc[test_index],y.iloc[test_index]))
print "average accuracy for LogisticRegression is", np.mean(scores)
print "average of the set is: ", np.mean(y)









    



average accuracy for LogisticRegression is 0.847468758494
average of the set is:  0.151795236402



In [141]:

    
lr_scores = lr.predict_proba(X)[:,1]



In [142]:

    
print metrics.roc_auc_score(y,lr_scores)









    



0.672637614814



In [143]:

    
print metrics.roc_auc_score(y,lr_scores)









    



0.672637614814



In [144]:

    
lr_scores









    Out[144]:





array([ 0.176529  ,  0.14691911,  0.14657648, ...,  0.14657648,
        0.23156819,  0.14657648])



In [145]:

    
coefficients = pd.DataFrame(zip(X.columns,lr.coef_[0]),columns=['tags','coefficients'])
probabilities = pd.DataFrame(zip(X.columns,lr_scores),columns=['tags','probabilities'])



In [146]:

    
probabilities.sort_values('probabilities',ascending=False)









    Out[146]:






  
    
      
      tags
      probabilities
    
  
  
    
      33
      news
      0.189916
    
    
      0
      100-wonders
      0.176529
    
    
      1
      31-days-of-halloween
      0.146919
    
    
      8
      cemeteries
      0.146576
    
    
      51
      war
      0.146576
    
    
      21
      garbage-week
      0.146576
    
    
      10
      churches
      0.146576
    
    
      52
      wwii
      0.146576
    
    
      3
      animals
      0.146576
    
    
      2
      abandoned
      0.146576
    
    
      24
      libraries
      0.091183
    
    
      11
      columns
      0.091183
    
    
      47
      underground-week
      0.091183
    
    
      22
      holidays
      0.091183
    
    
      38
      places-you-can-no-longer-go
      0.090890
    
    
      12
      crime-and-punishment
      0.090599
    
    
      35
      objects-of-intrigue
      0.089517
    
    
      39
      politics
      0.084188
    
    
      6
      birds
      0.061686
    
    
      9
      cheat-week
      0.059556
    
    
      5
      art
      0.059297
    
    
      36
      obscura-day
      0.056281
    
    
      37
      oceans
      0.056281
    
    
      7
      books
      0.056281
    
    
      40
      rites-and-rituals
      0.056281
    
    
      32
      music
      0.056281
    
    
      43
      society-adventures
      0.056281
    
    
      44
      space
      0.056281
    
    
      48
      video
      0.056281
    
    
      49
      video-wonders
      0.056281
    
    
      50
      visual
      0.056281
    
    
      41
      ruins
      0.056281
    
    
      30
      morbid-monday
      0.056281
    
    
      19
      food
      0.056281
    
    
      25
      list
      0.056281
    
    
      13
      curious-fact-of-the-week
      0.056281
    
    
      14
      death
      0.056281
    
    
      16
      features
      0.056281
    
    
      20
      found
      0.056281
    
    
      26
      magic
      0.056281
    
    
      4
      architecture
      0.056281
    
    
      28
      maps
      0.056281
    
    
      29
      medicine
      0.056281
    
    
      31
      museums
      0.051961
    
    
      15
      exploration
      0.040709
    
    
      27
      map-monday
      0.037582
    
    
      42
      science
      0.033551
    
    
      23
      islands
      0.033012
    
    
      18
      fleeting-wonders
      0.033012
    
    
      34
      notes-from-the-field
      0.029644
    
    
      45
      sports
      0.029644
    
    
      17
      film
      0.022731
    
    
      46
      transportation
      0.020998



In [147]:

    
coefficients.sort_values('coefficients',ascending=False)









    Out[147]:






  
    
      
      tags
      coefficients
    
  
  
    
      52
      wwii
      1.065043
    
    
      28
      maps
      0.848714
    
    
      7
      books
      0.836221
    
    
      47
      underground-week
      0.575863
    
    
      29
      medicine
      0.567487
    
    
      2
      abandoned
      0.562231
    
    
      9
      cheat-week
      0.517251
    
    
      24
      libraries
      0.385578
    
    
      30
      morbid-monday
      0.327623
    
    
      4
      architecture
      0.303256
    
    
      49
      video-wonders
      0.293608
    
    
      25
      list
      0.234942
    
    
      16
      features
      0.221080
    
    
      11
      columns
      0.204949
    
    
      31
      museums
      0.159610
    
    
      37
      oceans
      0.134381
    
    
      10
      churches
      0.129452
    
    
      14
      death
      0.102822
    
    
      41
      ruins
      0.055650
    
    
      32
      music
      0.055465
    
    
      21
      garbage-week
      0.013623
    
    
      3
      animals
      0.002736
    
    
      5
      art
      0.000338
    
    
      19
      food
      -0.007075
    
    
      50
      visual
      -0.013276
    
    
      17
      film
      -0.021714
    
    
      1
      31-days-of-halloween
      -0.023128
    
    
      20
      found
      -0.043365
    
    
      51
      war
      -0.084437
    
    
      46
      transportation
      -0.102804
    
    
      27
      map-monday
      -0.204296
    
    
      22
      holidays
      -0.280851
    
    
      26
      magic
      -0.340270
    
    
      33
      news
      -0.403831
    
    
      35
      objects-of-intrigue
      -0.430410
    
    
      23
      islands
      -0.464792
    
    
      12
      crime-and-punishment
      -0.488901
    
    
      48
      video
      -0.520194
    
    
      0
      100-wonders
      -0.537564
    
    
      44
      space
      -0.541108
    
    
      42
      science
      -0.557842
    
    
      18
      fleeting-wonders
      -0.601600
    
    
      8
      cemeteries
      -0.668934
    
    
      39
      politics
      -0.709310
    
    
      36
      obscura-day
      -0.759159
    
    
      40
      rites-and-rituals
      -0.922080
    
    
      6
      birds
      -0.941553
    
    
      15
      exploration
      -0.959669
    
    
      34
      notes-from-the-field
      -0.964113
    
    
      43
      society-adventures
      -1.238194
    
    
      45
      sports
      -1.278518
    
    
      13
      curious-fact-of-the-week
      -1.470357
    
    
      38
      places-you-can-no-longer-go
      -1.730003



In [148]:

    
tag_analysis[tag_analysis['100-wonders'] ==1].describe()









    Out[148]:






  
    
      
      100-wonders
      31-days-of-halloween
      abandoned
      animals
      architecture
      art
      birds
      books
      cemeteries
      cheat-week
      ...
      sports
      transportation
      underground-week
      video
      video-wonders
      visual
      war
      wwii
      pageviews
      upper_quartile
    
  
  
    
      count
      44.0
      44.0
      44.000000
      44.0
      44.000000
      44.0
      44.000000
      44.0
      44.000000
      44.0
      ...
      44.0
      44.0
      44.0
      44.000000
      44.0
      44.0
      44.000000
      44.0
      44.000000
      44.000000
    
    
      mean
      1.0
      0.0
      0.045455
      0.0
      0.022727
      0.0
      0.022727
      0.0
      0.045455
      0.0
      ...
      0.0
      0.0
      0.0
      0.840909
      0.0
      0.0
      0.022727
      0.0
      3817.977273
      0.068182
    
    
      std
      0.0
      0.0
      0.210707
      0.0
      0.150756
      0.0
      0.150756
      0.0
      0.210707
      0.0
      ...
      0.0
      0.0
      0.0
      0.369989
      0.0
      0.0
      0.150756
      0.0
      3259.347741
      0.254972
    
    
      min
      1.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.000000
      0.0
      ...
      0.0
      0.0
      0.0
      0.000000
      0.0
      0.0
      0.000000
      0.0
      792.000000
      0.000000
    
    
      25%
      1.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.000000
      0.0
      ...
      0.0
      0.0
      0.0
      1.000000
      0.0
      0.0
      0.000000
      0.0
      1737.750000
      0.000000
    
    
      50%
      1.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.000000
      0.0
      ...
      0.0
      0.0
      0.0
      1.000000
      0.0
      0.0
      0.000000
      0.0
      2771.000000
      0.000000
    
    
      75%
      1.0
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.000000
      0.0
      0.000000
      0.0
      ...
      0.0
      0.0
      0.0
      1.000000
      0.0
      0.0
      0.000000
      0.0
      4522.750000
      0.000000
    
    
      max
      1.0
      0.0
      1.000000
      0.0
      1.000000
      0.0
      1.000000
      0.0
      1.000000
      0.0
      ...
      0.0
      0.0
      0.0
      1.000000
      0.0
      0.0
      1.000000
      0.0
      16769.000000
      1.000000
    
  

8 rows × 55 columns



In [149]:

    
tag_analysis.head()









    Out[149]:






  
    
      
      100-wonders
      31-days-of-halloween
      abandoned
      animals
      architecture
      art
      birds
      books
      cemeteries
      cheat-week
      ...
      transportation
      underground-week
      video
      video-wonders
      visual
      war
      wwii
      published
      pageviews
      upper_quartile
    
    
      tagged_url
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      2015-08-01
      651.0
      0
    
    
      www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos
      0.0
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-06-09
      3505.0
      0
    
    
      www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2014-05-12
      840.0
      0
    
    
      www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      2015-12-30
      4037.0
      0
    
    
      www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy
      1.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      0.0
      ...
      0.0
      0.0
      1.0
      0.0
      0.0
      0.0
      0.0
      2016-01-07
      1620.0
      0
    
  

5 rows × 56 columns

Now let's try it with KNN



In [150]:

    
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier



In [ ]:

    
params = {'n_neighbors': [x for x in range(2,200,1)],
          'weights': ['distance','uniform']}
gs = GridSearchCV(estimator=KNeighborsClassifier(),param_grid=params,n_jobs=8,cv=10)
gs.fit(X,y)
print gs.best_params_
print gs.best_score_



In [160]:

    
print type(gs.best_estimator_)









    



<class 'sklearn.neighbors.classification.KNeighborsClassifier'>



In [161]:

    
knn = gs.best_estimator_.fit(X,y)



In [162]:

    
knn_scores = knn.predict_proba(X)[:,1]



In [163]:

    
print np.mean(knn_scores)









    



0.160428622213



In [164]:

    
print np.mean(lr_scores)









    



0.134654700331



In [165]:

    
knn_probabilities = pd.DataFrame(zip(X.columns,knn_scores),columns=['tags','probabilities'])



In [166]:

    
knn_probabilities.sort_values('probabilities',ascending=False)









    Out[166]:






  
    
      
      tags
      probabilities
    
  
  
    
      52
      wwii
      0.250000
    
    
      8
      cemeteries
      0.250000
    
    
      2
      abandoned
      0.250000
    
    
      3
      animals
      0.250000
    
    
      51
      war
      0.250000
    
    
      21
      garbage-week
      0.250000
    
    
      10
      churches
      0.250000
    
    
      33
      news
      0.214286
    
    
      0
      100-wonders
      0.214286
    
    
      12
      crime-and-punishment
      0.178571
    
    
      35
      objects-of-intrigue
      0.142857
    
    
      42
      science
      0.107143
    
    
      9
      cheat-week
      0.071429
    
    
      5
      art
      0.071429
    
    
      34
      notes-from-the-field
      0.071429
    
    
      36
      obscura-day
      0.071429
    
    
      37
      oceans
      0.071429
    
    
      39
      politics
      0.071429
    
    
      40
      rites-and-rituals
      0.071429
    
    
      41
      ruins
      0.071429
    
    
      43
      society-adventures
      0.071429
    
    
      7
      books
      0.071429
    
    
      44
      space
      0.071429
    
    
      45
      sports
      0.071429
    
    
      46
      transportation
      0.071429
    
    
      47
      underground-week
      0.071429
    
    
      48
      video
      0.071429
    
    
      49
      video-wonders
      0.071429
    
    
      50
      visual
      0.071429
    
    
      4
      architecture
      0.071429
    
    
      32
      music
      0.071429
    
    
      31
      museums
      0.071429
    
    
      30
      morbid-monday
      0.071429
    
    
      20
      found
      0.071429
    
    
      11
      columns
      0.071429
    
    
      13
      curious-fact-of-the-week
      0.071429
    
    
      14
      death
      0.071429
    
    
      15
      exploration
      0.071429
    
    
      16
      features
      0.071429
    
    
      17
      film
      0.071429
    
    
      18
      fleeting-wonders
      0.071429
    
    
      19
      food
      0.071429
    
    
      6
      birds
      0.071429
    
    
      29
      medicine
      0.071429
    
    
      22
      holidays
      0.071429
    
    
      23
      islands
      0.071429
    
    
      24
      libraries
      0.071429
    
    
      25
      list
      0.071429
    
    
      27
      map-monday
      0.071429
    
    
      28
      maps
      0.071429
    
    
      26
      magic
      0.071429
    
    
      38
      places-you-can-no-longer-go
      0.035714
    
    
      1
      31-days-of-halloween
      0.000000

Let's check the roc_auc scores for both the knn and logistic regression models.



In [167]:

    
print 'knn', metrics.roc_auc_score(y,knn_scores)
print 'lr', metrics.roc_auc_score(y,lr_scores)









    



knn 0.672067348369
lr 0.672637614814

Looks like they give similar scores, but the scores are easily manipulated by changing the threshold for the number of articles per tag and by changing the threshold for "success" (currently set at > 10,000 Pageviews).



In [62]:

    
probabilities = probabilities.set_index('tags')



In [63]:

    
probabilities = probabilities.join(total_tagged)



In [64]:

    
probabilities.to_csv('tag-probabilities-logisticregression.csv')

Now let's try RandomForest



In [65]:

    
from sklearn.ensemble import RandomForestClassifier



In [66]:

    
params = {'max_depth': np.arange(20,100,2),
          'min_samples_leaf': np.arange(90,200,2),
          'n_estimators': 20}
gs1 = GridSearchCV(RandomForestClassifier(),param_grid=params, cv=10, scoring='roc_auc',n_jobs=8,verbose=1)
gs1.fit(X,y)
print gs1.best_params_
print gs1.best_score_









    



Fitting 10 folds for each of 4400 candidates, totalling 44000 fits






    



[Parallel(n_jobs=8)]: Done  52 tasks      | elapsed:    0.6s
[Parallel(n_jobs=8)]: Done 352 tasks      | elapsed:    2.7s
[Parallel(n_jobs=8)]: Done 852 tasks      | elapsed:    6.6s
[Parallel(n_jobs=8)]: Done 1552 tasks      | elapsed:   12.5s
[Parallel(n_jobs=8)]: Done 2452 tasks      | elapsed:   21.4s
[Parallel(n_jobs=8)]: Done 3552 tasks      | elapsed:   31.1s
[Parallel(n_jobs=8)]: Done 4852 tasks      | elapsed:   43.3s
[Parallel(n_jobs=8)]: Done 6352 tasks      | elapsed:   57.3s
[Parallel(n_jobs=8)]: Done 8052 tasks      | elapsed:  1.2min
[Parallel(n_jobs=8)]: Done 9952 tasks      | elapsed:  1.5min
[Parallel(n_jobs=8)]: Done 12052 tasks      | elapsed:  1.9min
[Parallel(n_jobs=8)]: Done 14352 tasks      | elapsed:  2.2min
[Parallel(n_jobs=8)]: Done 16852 tasks      | elapsed:  2.6min
[Parallel(n_jobs=8)]: Done 19552 tasks      | elapsed:  3.0min
[Parallel(n_jobs=8)]: Done 22452 tasks      | elapsed:  3.5min
[Parallel(n_jobs=8)]: Done 25552 tasks      | elapsed:  4.0min
[Parallel(n_jobs=8)]: Done 28852 tasks      | elapsed:  4.5min
[Parallel(n_jobs=8)]: Done 32352 tasks      | elapsed:  5.0min
[Parallel(n_jobs=8)]: Done 36052 tasks      | elapsed:  5.6min
[Parallel(n_jobs=8)]: Done 39952 tasks      | elapsed:  6.3min
[Parallel(n_jobs=8)]: Done 44000 out of 44000 | elapsed:  6.9min finished






    



{'max_depth': 95, 'min_samples_leaf': 116}
0.568912853263



In [67]:

    
rf = RandomForestClassifier(gs1.best_estimator_)
rf.fit(X,y)
probs = rf.predict_proba(X)[:,1]
print rf.score(X,y)
print metrics.roc_auc_score(y,probs)









    



0.753999289015
0.551845977331



In [69]:

    
probs = pd.DataFrame(zip(X.columns,probs),columns=['tags','probabilities'])



In [71]:

    
probs.sort_values('probabilities',ascending=False)









    Out[71]:






  
    
      
      tags
      probabilities
    
  
  
    
      0
      100-wonders
      0.250698
    
    
      27
      map-monday
      0.250698
    
    
      29
      medicine
      0.250698
    
    
      30
      morbid-monday
      0.250698
    
    
      31
      museums
      0.250698
    
    
      32
      music
      0.250698
    
    
      33
      news
      0.250698
    
    
      34
      notes-from-the-field
      0.250698
    
    
      35
      objects-of-intrigue
      0.250698
    
    
      36
      obscura-day
      0.250698
    
    
      37
      oceans
      0.250698
    
    
      38
      places-you-can-no-longer-go
      0.250698
    
    
      39
      politics
      0.250698
    
    
      40
      rites-and-rituals
      0.250698
    
    
      41
      ruins
      0.250698
    
    
      42
      science
      0.250698
    
    
      43
      society-adventures
      0.250698
    
    
      44
      space
      0.250698
    
    
      45
      sports
      0.250698
    
    
      46
      transportation
      0.250698
    
    
      47
      underground-week
      0.250698
    
    
      48
      video
      0.250698
    
    
      49
      video-wonders
      0.250698
    
    
      50
      visual
      0.250698
    
    
      51
      war
      0.250698
    
    
      28
      maps
      0.250698
    
    
      26
      magic
      0.250698
    
    
      1
      31-days-of-halloween
      0.250698
    
    
      25
      list
      0.250698
    
    
      2
      abandoned
      0.250698
    
    
      3
      animals
      0.250698
    
    
      4
      architecture
      0.250698
    
    
      5
      art
      0.250698
    
    
      6
      birds
      0.250698
    
    
      7
      books
      0.250698
    
    
      8
      cemeteries
      0.250698
    
    
      9
      cheat-week
      0.250698
    
    
      10
      churches
      0.250698
    
    
      11
      columns
      0.250698
    
    
      12
      crime-and-punishment
      0.250698
    
    
      13
      curious-fact-of-the-week
      0.250698
    
    
      14
      death
      0.250698
    
    
      15
      exploration
      0.250698
    
    
      16
      features
      0.250698
    
    
      17
      film
      0.250698
    
    
      18
      fleeting-wonders
      0.250698
    
    
      19
      food
      0.250698
    
    
      20
      found
      0.250698
    
    
      21
      garbage-week
      0.250698
    
    
      22
      holidays
      0.250698
    
    
      23
      islands
      0.250698
    
    
      24
      libraries
      0.250698
    
    
      52
      wwii
      0.250698

Let's try the Logistic Model but with more tags



In [144]:

    
tag_analysis2 = article_set.drop(total_tagged[total_tagged.num_tagged < 15].index,axis=1)



In [190]:

    
tag_analysis2['ten_thousand'] = [1 if x > 10000 else 0 for x in tag_analysis2.pageviews]



In [191]:

    
tag_analysis2.fillna(value=0,inplace=True)
y2 = tag_analysis2.ten_thousand
X2 = tag_analysis2.drop(['pageviews','upper_quartile','ten_thousand'],axis=1)



In [192]:

    
kf2 = cross_validation.KFold(len(tag_analysis2),n_folds=5)
scores2 = []
for train_index, test_index in kf2:
    lr2 = linear_model.LogisticRegression().fit(X2.iloc[train_index],y2.iloc[train_index])
    scores2.append(lr2.score(X2.iloc[test_index],y2.iloc[test_index]))
print "average accuracy for LogisticRegression is", np.mean(scores2)
print "average of the set is: ", np.mean(y2)









    



average accuracy for LogisticRegression is 0.846403039133
average of the set is:  0.151795236402



In [193]:

    
print tag_analysis2.shape
print y2.shape
print X2.shape









    



(2813, 136)
(2813,)
(2813, 133)



In [194]:

    
lr_scores2 = lr2.predict_proba(X2)[:,1]



In [195]:

    
lr2_probs = pd.DataFrame(zip(X2.columns,lr_scores2),columns=['tags','probabilities'])



In [196]:

    
lr2_probs.sort_values('probabilities',ascending=False)









    Out[196]:






  
    
      
      tags
      probabilities
    
  
  
    
      130
      women
      0.371218
    
    
      3
      aircraft
      0.330070
    
    
      125
      visual
      0.313056
    
    
      110
      sports
      0.303625
    
    
      71
      mummies
      0.288925
    
    
      99
      saints
      0.272294
    
    
      115
      time-week
      0.263755
    
    
      98
      ruins
      0.241347
    
    
      109
      space
      0.235258
    
    
      100
      science
      0.235128
    
    
      70
      mountains
      0.233971
    
    
      103
      ships
      0.215629
    
    
      119
      tunnels
      0.211667
    
    
      118
      trees
      0.207395
    
    
      111
      statues
      0.204833
    
    
      116
      trains
      0.204833
    
    
      60
      magic
      0.197443
    
    
      107
      society-adventures
      0.191618
    
    
      0
      100-wonders
      0.188841
    
    
      57
      libraries
      0.182539
    
    
      112
      subterranean
      0.182539
    
    
      132
      wwii
      0.166283
    
    
      131
      world-s-fair
      0.152512
    
    
      105
      skeletons
      0.147225
    
    
      53
      insects
      0.145729
    
    
      61
      map-monday
      0.142249
    
    
      77
      nasa
      0.141422
    
    
      82
      no-ones-watching-week
      0.141422
    
    
      80
      new-york-city
      0.141422
    
    
      79
      naturecultures
      0.141422
    
    
      ...
      ...
      ...
    
    
      5
      ancient
      0.043366
    
    
      28
      design
      0.043092
    
    
      9
      architecture
      0.042018
    
    
      52
      infrastructure
      0.041597
    
    
      6
      animals
      0.041193
    
    
      32
      escape-week
      0.039818
    
    
      13
      books
      0.037367
    
    
      29
      dinosaurs
      0.037367
    
    
      41
      fleeting-wonders
      0.037367
    
    
      101
      sculptures
      0.036208
    
    
      35
      extra-mile
      0.035296
    
    
      20
      churches
      0.033240
    
    
      25
      crime-and-punishment
      0.033240
    
    
      16
      cemeteries
      0.033240
    
    
      14
      cats
      0.033240
    
    
      49
      halloween
      0.033240
    
    
      50
      holidays
      0.033240
    
    
      44
      games
      0.031995
    
    
      15
      caves
      0.031769
    
    
      42
      food
      0.031127
    
    
      27
      death
      0.028319
    
    
      30
      disaster-areas
      0.027857
    
    
      37
      features
      0.023676
    
    
      23
      computers
      0.021552
    
    
      34
      exploration
      0.017840
    
    
      45
      garbage
      0.017840
    
    
      18
      china
      0.016213
    
    
      4
      airplanes
      0.013174
    
    
      17
      cheat-week
      0.009833
    
    
      46
      garbage-week
      0.006122
    
  

133 rows × 2 columns



In [197]:

    
metrics.roc_auc_score(y2,lr2.predict_proba(X2)[:,1])









    Out[197]:





0.73045144294096509



In [198]:

    
lr2_probs = lr2_probs.set_index('tags')



In [199]:

    
lr2_probs = lr2_probs.join(total_tagged)



In [206]:

    
plt.figure(figsize=(10,10))
plt.scatter(lr2_probs.num_tagged,lr2_probs.probabilities)
plt.show()



In [201]:

    
lr2_probs = lr2_probs.sort_values('probabilities',ascending=False)



In [202]:

    
lr2_probs = lr2_probs.reset_index()



In [203]:

    
lr2_probs.to_csv('min15tags_min10000pvs.csv')



In [204]:

    
lr2_probs.shape









    Out[204]:





(133, 3)



In [207]:

    
lr2_probs









    Out[207]:






  
    
      
      tags
      probabilities
      num_tagged
    
  
  
    
      0
      women
      0.371218
      16.0
    
    
      1
      aircraft
      0.330070
      16.0
    
    
      2
      visual
      0.313056
      117.0
    
    
      3
      sports
      0.303625
      45.0
    
    
      4
      mummies
      0.288925
      27.0
    
    
      5
      saints
      0.272294
      19.0
    
    
      6
      time-week
      0.263755
      27.0
    
    
      7
      ruins
      0.241347
      36.0
    
    
      8
      space
      0.235258
      105.0
    
    
      9
      science
      0.235128
      58.0
    
    
      10
      mountains
      0.233971
      21.0
    
    
      11
      ships
      0.215629
      20.0
    
    
      12
      tunnels
      0.211667
      22.0
    
    
      13
      trees
      0.207395
      29.0
    
    
      14
      statues
      0.204833
      18.0
    
    
      15
      trains
      0.204833
      17.0
    
    
      16
      magic
      0.197443
      40.0
    
    
      17
      society-adventures
      0.191618
      57.0
    
    
      18
      100-wonders
      0.188841
      44.0
    
    
      19
      libraries
      0.182539
      31.0
    
    
      20
      subterranean
      0.182539
      18.0
    
    
      21
      wwii
      0.166283
      33.0
    
    
      22
      world-s-fair
      0.152512
      20.0
    
    
      23
      skeletons
      0.147225
      20.0
    
    
      24
      insects
      0.145729
      15.0
    
    
      25
      map-monday
      0.142249
      35.0
    
    
      26
      nasa
      0.141422
      18.0
    
    
      27
      no-ones-watching-week
      0.141422
      17.0
    
    
      28
      new-york-city
      0.141422
      28.0
    
    
      29
      naturecultures
      0.141422
      27.0
    
    
      ...
      ...
      ...
      ...
    
    
      103
      ancient
      0.043366
      15.0
    
    
      104
      design
      0.043092
      15.0
    
    
      105
      architecture
      0.042018
      43.0
    
    
      106
      infrastructure
      0.041597
      20.0
    
    
      107
      animals
      0.041193
      165.0
    
    
      108
      escape-week
      0.039818
      26.0
    
    
      109
      books
      0.037367
      37.0
    
    
      110
      dinosaurs
      0.037367
      19.0
    
    
      111
      fleeting-wonders
      0.037367
      156.0
    
    
      112
      sculptures
      0.036208
      19.0
    
    
      113
      extra-mile
      0.035296
      16.0
    
    
      114
      churches
      0.033240
      32.0
    
    
      115
      crime-and-punishment
      0.033240
      52.0
    
    
      116
      cemeteries
      0.033240
      65.0
    
    
      117
      cats
      0.033240
      19.0
    
    
      118
      halloween
      0.033240
      16.0
    
    
      119
      holidays
      0.033240
      33.0
    
    
      120
      games
      0.031995
      20.0
    
    
      121
      caves
      0.031769
      23.0
    
    
      122
      food
      0.031127
      68.0
    
    
      123
      death
      0.028319
      43.0
    
    
      124
      disaster-areas
      0.027857
      25.0
    
    
      125
      features
      0.023676
      356.0
    
    
      126
      computers
      0.021552
      22.0
    
    
      127
      exploration
      0.017840
      30.0
    
    
      128
      garbage
      0.017840
      19.0
    
    
      129
      china
      0.016213
      19.0
    
    
      130
      airplanes
      0.013174
      22.0
    
    
      131
      cheat-week
      0.009833
      38.0
    
    
      132
      garbage-week
      0.006122
      31.0
    
  

133 rows × 3 columns



In [ ]:

	tag_id	tag_url	tagged_type	tagged_id	tagged_url
0	36	www.atlasobscura.com/categories/abandoned	Place	9982	www.atlasobscura.com/places/athens-olympic-spo...
1	2	www.atlasobscura.com/categories/panoramas	Place	1676	www.atlasobscura.com/places/velaslavasay-panorama
2	2	www.atlasobscura.com/categories/panoramas	Place	6431	www.atlasobscura.com/places/gettysburg-cyclorama
3	2	www.atlasobscura.com/categories/panoramas	Article	2311	www.atlasobscura.com/articles/rip-gettysburg-c...
4	258	www.atlasobscura.com/categories/bridges	Place	10134	www.atlasobscura.com/places/gimbel-s-bridge
5	2	www.atlasobscura.com/categories/panoramas	Place	6430	www.atlasobscura.com/places/borodino-panorama
6	2	www.atlasobscura.com/categories/panoramas	Place	6428	www.atlasobscura.com/places/panorama-mesdag
7	2	www.atlasobscura.com/categories/panoramas	Place	3688	www.atlasobscura.com/places/panorama-raclawice
8	3	www.atlasobscura.com/categories/disasters	Place	6343	www.atlasobscura.com/places/mars-bluff-crater
9	4	www.atlasobscura.com/categories/atom-bombs	Place	6343	www.atlasobscura.com/places/mars-bluff-crater

	tag_id	tag_url	tagged_type	tagged_id	tagged_url
3	2	panoramas	Article	2311	www.atlasobscura.com/articles/rip-gettysburg-c...
56	27	objects-of-intrigue	Article	2227	www.atlasobscura.com/articles/objects-of-intri...
57	27	objects-of-intrigue	Article	2268	www.atlasobscura.com/articles/objects-of-intri...
58	27	objects-of-intrigue	Article	2213	www.atlasobscura.com/articles/objects-of-intri...
62	27	objects-of-intrigue	Article	2216	www.atlasobscura.com/articles/objects-of-intri...

	100-wonders	19th-century	2016-election	30-rock	31-days-of-halloween	abandoned	abandoned-amusement-parks	abandoned-brooklyn	abandoned-cemetaries	abandoned-hospitals	...	world-s-smallest	world-s-tallest	world-war-ii	wunderkammer	wwi	wwii	yehlui-geological-park	yeti	zombies	zoos
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
56	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
57	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
58	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
62	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	url	published	pageviews
0	jamaica-may-get-rid-of-queen-elizabeth-and-fin...	2016-04-15	3997
1	trippy-blacklight-posters-from-the-psychedelic...	2016-04-15	7042
2	leonardo-da-vincis-living-descendants-have-bee...	2016-04-15	12448
3	catapult-into-the-weekend-like-this-gopro-off-...	2016-04-15	4187
4	cat-rescued-after-4-days-stuck-on-insanely-tal...	2016-04-15	2721

	url	published	pageviews
0	www.atlasobscura.com/articles/jamaica-may-get-...	2016-04-15	3997
1	www.atlasobscura.com/articles/trippy-blackligh...	2016-04-15	7042
2	www.atlasobscura.com/articles/leonardo-da-vinc...	2016-04-15	12448
3	www.atlasobscura.com/articles/catapult-into-th...	2016-04-15	4187
4	www.atlasobscura.com/articles/cat-rescued-afte...	2016-04-15	2721

	pageviews
count	3446.000000
mean	7052.891759
std	23256.215270
min	1.000000
25%	1150.250000
50%	2571.500000
75%	5834.750000
max	621494.000000

	100-wonders	19th-century	2016-election	30-rock	31-days-of-halloween	abandoned	abandoned-amusement-parks	abandoned-brooklyn	abandoned-cemetaries	abandoned-hospitals	...	world-war-ii	wunderkammer	wwi	wwii	yehlui-geological-park	yeti	zombies	zoos	published	pageviews
tagged_url
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2015-08-01	651.0
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2015-06-09	3505.0
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2014-05-12	840.0
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2015-12-30	4037.0
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	2016-01-07	1620.0

	0	1	2	3	4	5	6	7	8	9	...	722	723	724	725	726	727	728	729	730	total
10-little-known-beaches-to-explore-in-the-last-days-of-summer	2.0	419.0	203.0	19.0	4.0	6.0	2.0	7.0	4.0	35.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	940.0
10-of-the-greatest-overland-migrations-photos	468.0	368.0	658.0	325.0	138.0	40.0	33.0	77.0	63.0	21.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4826.0
10-places-12-year-old-me-would-love-to-live	106.0	762.0	271.0	132.0	209.0	96.0	41.0	15.0	9.0	9.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4621.0
10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton	2186.0	538.0	209.0	377.0	92.0	134.0	80.0	34.0	18.0	75.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4482.0
100-wonders-a-visit-with-a-frozen-dead-guy	928.0	272.0	231.0	87.0	96.0	40.0	16.0	11.0	7.0	7.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2032.0

	0	1	2	3	4	5	6	7	8	9	...	724	725	726	727	728	729	730	total	days_to_90p	published
index
www.atlasobscura.com/articles/10-little-known-beaches-to-explore-in-the-last-days-of-summer	2.0	419.0	203.0	19.0	4.0	6.0	2.0	7.0	4.0	35.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	940.0	189	2015-08-01
www.atlasobscura.com/articles/10-of-the-greatest-overland-migrations-photos	468.0	368.0	658.0	325.0	138.0	40.0	33.0	77.0	63.0	21.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4826.0	230	2015-06-09
www.atlasobscura.com/articles/10-places-12-year-old-me-would-love-to-live	106.0	762.0	271.0	132.0	209.0	96.0	41.0	15.0	9.0	9.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4621.0	634	2014-05-12
www.atlasobscura.com/articles/10-things-that-you-have-secretly-been-dying-to-know-about-the-world-of-hamilton	2186.0	538.0	209.0	377.0	92.0	134.0	80.0	34.0	18.0	75.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4482.0	19	2015-12-30
www.atlasobscura.com/articles/100-wonders-a-visit-with-a-frozen-dead-guy	928.0	272.0	231.0	87.0	96.0	40.0	16.0	11.0	7.0	7.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2032.0	45	2016-01-07

	days_to_90p	total	year_pub
days_to_90p	1.000000	-0.058821	-0.742601
total	-0.058821	1.000000	0.092965
year_pub	-0.742601	0.092965	1.000000

	100-wonders	19th-century	2016-election	30-rock	31-days-of-halloween	abandoned	abandoned-amusement-parks	abandoned-brooklyn	abandoned-cemetaries	abandoned-hospitals	...	world-s-smallest	world-s-tallest	world-war-ii	wunderkammer	wwi	wwii	yehlui-geological-park	yeti	zombies	zoos
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
56	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
57	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
58	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
62	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	num_tagged
news_space	43
animals_features	38
news_animals	38
animals_news	38
features_animals	38
100-wonders_video	37
video_100-wonders	37
columns_features	35
features_columns	35
columns_map-monday	33

	tags	probabilities
109	features_time-week	0.436462
53	features_animals	0.400695
57	features_birds	0.368227
149	animals_animal-week	0.296609
111	features_underground-week	0.283282
116	features_watery-wonders	0.283282
150	animals_birds	0.270520
209	maps_exploration	0.261088
110	features_tunnels	0.261088
24	news_features	0.183567
223	naturecultures_columns	0.178019
190	art_fleeting-wonders	0.175223
42	news_sculptures	0.169818
48	news_trees	0.169818
26	news_fossils	0.169818
158	animals_found	0.163157
163	animals_video	0.155569
138	found_archaeology	0.155189
118	features_women	0.154375
168	fleeting-wonders_food	0.151925
101	features_sculptures	0.148187
69	features_fashion	0.147209
171	fleeting-wonders_sports	0.141600
145	found_science	0.129710
146	found_shipwrecks	0.129710
147	found_space	0.129710
148	found_war	0.129710
154	animals_features	0.129710
151	animals_cats	0.129710
153	animals_dogs	0.129710
...	...	...
50	news_volcanoes	0.058617
49	news_underwater	0.058617
13	news_architecture	0.058617
14	news_art	0.058617
15	news_australia	0.058617
16	news_birds	0.058617
17	news_books	0.058617
37	news_oceans	0.058617
39	news_politics	0.058617
20	news_crime-and-punishment	0.058617
29	news_insects	0.058617
36	news_nasa	0.058617
40	news_religion	0.058617
34	news_music	0.058617
41	news_science	0.058617
32	news_literature	0.058617
31	news_japan	0.058617
30	news_islands	0.058617
43	news_shipwrecks	0.058617
25	news_food	0.058617
44	news_snakes	0.058617
45	news_space	0.058617
66	features_crime-and-punishment	0.053248
105	features_sports	0.040151
9	news_airplanes	0.040049
46	news_sports	0.040049
27	news_garbage-week	0.040049
23	news_dogs	0.040049
18	news_churches	0.040049
143	found_maps	0.036773

	tags	probabilities	subtag	num_tagged
184	100-wonders_disaster-areas	0.099160	disaster-areas	6.0
23	100-wonders_disasters	0.129710	disasters	6.0
24	100-wonders_science	0.129710	science	5.0
25	100-wonders_video	0.129710	video	37.0
3	animals_animal-week	0.296609	animal-week	10.0
6	animals_birds	0.270520	birds	8.0
26	animals_cats	0.129710	cats	8.0
176	animals_columns	0.121526	columns	8.0
27	animals_dogs	0.129710	dogs	7.0
28	animals_features	0.129710	features	38.0
29	animals_fleeting-wonders	0.129710	fleeting-wonders	12.0
30	animals_food	0.129710	food	4.0
31	animals_fossils	0.129710	fossils	4.0
15	animals_found	0.163157	found	21.0
32	animals_list	0.129710	list	8.0
187	animals_naturecultures	0.085154	naturecultures	6.0
188	animals_news	0.082413	news	38.0
33	animals_oceans	0.129710	oceans	7.0
16	animals_video	0.155569	video	8.0
178	animals_video-wonders	0.116927	video-wonders	6.0
34	art_columns	0.129710	columns	6.0
191	art_features	0.077260	features	12.0
11	art_fleeting-wonders	0.175223	fleeting-wonders	4.0
35	art_libraries	0.129710	libraries	5.0
36	art_museums	0.129710	museums	4.0
37	art_museums-and-collections	0.129710	museums-and-collections	4.0
38	art_news	0.129710	news	11.0
39	art_sculptures	0.129710	sculptures	6.0
40	art_visual	0.129710	visual	16.0
41	columns_animals	0.129710	animals	8.0
...	...	...	...	...
212	news_snakes	0.058617	snakes	5.0
213	news_space	0.058617	space	43.0
224	news_sports	0.040049	sports	7.0
151	news_statues	0.129710	statues	5.0
14	news_trees	0.169818	trees	5.0
214	news_underwater	0.058617	underwater	7.0
215	news_volcanoes	0.058617	volcanoes	5.0
152	news_war	0.129710	war	4.0
153	news_water	0.129710	water	4.0
154	objects-of-intrigue_features	0.129710	features	7.0
155	objects-of-intrigue_space	0.129710	space	5.0
156	other-capitals-of-the-world_other-capitals-of-...	0.129710	other-capitals-of-the-world	12.0
216	video-wonders_animals	0.058617	animals	6.0
217	video-wonders_australia	0.058617	australia	4.0
163	video-wonders_sports	0.129710	sports	5.0
157	video_100-wonders	0.129710	100-wonders	37.0
158	video_animals	0.129710	animals	8.0
159	video_disaster-areas	0.129710	disaster-areas	5.0
160	video_disasters	0.129710	disasters	5.0
161	video_science	0.129710	science	5.0
162	video_sports	0.129710	sports	5.0
164	visual_abandoned	0.129710	abandoned	6.0
165	visual_architecture	0.129710	architecture	9.0
166	visual_art	0.129710	art	16.0
179	visual_features	0.110809	features	12.0
190	visual_list	0.077772	list	22.0
167	visual_photo-of-the-week	0.129710	photo-of-the-week	10.0
168	visual_photography	0.129710	photography	20.0
169	visual_soviet	0.129710	soviet	6.0
170	visual_space	0.129710	space	9.0

	tags	probabilities	subtag	num_tagged	pageviews	mean-PVs
75	features_linguistics	0.129710	linguistics	4.0	207623.0	51905.0
82	features_miracles-week	0.129710	miracles-week	8.0	390486.0	48810.0
63	features_computers	0.129710	computers	7.0	263227.0	37603.0
91	features_plants	0.129710	plants	5.0	145866.0	29173.0
66	features_film	0.129710	film	14.0	384072.0	27433.0
100	features_television	0.129710	television	8.0	200192.0	25024.0
191	art_features	0.077260	features	12.0	284906.0	23742.0
192	features_art	0.061562	art	12.0	284906.0	23742.0
74	features_language	0.129710	language	4.0	87118.0	21779.0
0	features_time-week	0.436462	time-week	10.0	171105.0	17110.0
101	features_video-games	0.129710	video-games	9.0	151971.0	16885.0
95	features_science-fiction	0.129710	science-fiction	4.0	66197.0	16549.0
106	features_wwii	0.129710	wwii	6.0	84399.0	14066.0
58	features_books	0.129710	books	8.0	107780.0	13472.0
56	features_architecture	0.129710	architecture	6.0	69250.0	11541.0
5	features_watery-wonders	0.283282	watery-wonders	4.0	43674.0	10918.0
87	features_naturecultures	0.129710	naturecultures	26.0	279450.0	10748.0
173	naturecultures_features	0.128812	features	26.0	279450.0	10748.0
85	features_murder	0.129710	murder	4.0	42335.0	10583.0
68	features_games	0.129710	games	8.0	84261.0	10532.0
59	features_cats	0.129710	cats	6.0	59547.0	9924.0
62	features_columns	0.129710	columns	35.0	339985.0	9713.0
46	columns_features	0.129710	features	35.0	339985.0	9713.0
81	features_military	0.129710	military	6.0	55619.0	9269.0
185	features_religion	0.098450	religion	5.0	44193.0	8838.0
177	features_space	0.119270	space	8.0	67853.0	8481.0
180	features_crime	0.107340	crime	6.0	49116.0	8186.0
4	features_underground-week	0.283282	underground-week	8.0	65102.0	8137.0
77	features_literature	0.129710	literature	9.0	68614.0	7623.0
102	features_visual	0.129710	visual	12.0	89421.0	7451.0
...	...	...	...	...	...	...
84	features_monsters	0.129710	monsters	6.0	35139.0	5856.0
105	features_witchcraft	0.129710	witchcraft	4.0	23395.0	5848.0
73	features_kickass-women	0.129710	kickass-women	5.0	29003.0	5800.0
21	features_fashion	0.147209	fashion	5.0	28595.0	5719.0
94	features_science	0.129710	science	15.0	82426.0	5495.0
64	features_dinosaurs	0.129710	dinosaurs	4.0	21218.0	5304.0
2	features_birds	0.368227	birds	9.0	47373.0	5263.0
88	features_new-york-city	0.129710	new-york-city	6.0	31498.0	5249.0
86	features_music	0.129710	music	12.0	61632.0	5136.0
80	features_medicine	0.129710	medicine	5.0	21993.0	4398.0
104	features_water	0.129710	water	7.0	30590.0	4370.0
69	features_garbage	0.129710	garbage	4.0	17274.0	4318.0
9	news_features	0.183567	features	6.0	25507.0	4251.0
89	features_news	0.129710	news	6.0	25507.0	4251.0
20	features_sculptures	0.148187	sculptures	5.0	20346.0	4069.0
99	features_technology	0.129710	technology	4.0	15892.0	3973.0
57	features_birdweek	0.129710	birdweek	9.0	35752.0	3972.0
98	features_statues	0.129710	statues	4.0	15627.0	3906.0
60	features_cheat-week	0.129710	cheat-week	8.0	31149.0	3893.0
103	features_war	0.129710	war	5.0	18591.0	3718.0
92	features_politics	0.129710	politics	14.0	50899.0	3635.0
219	features_sports	0.040151	sports	13.0	40004.0	3077.0
93	features_presidents	0.129710	presidents	4.0	12032.0	3008.0
7	features_tunnels	0.261088	tunnels	4.0	11691.0	2922.0
61	features_china	0.129710	china	9.0	17069.0	1896.0
71	features_halloween	0.129710	halloween	4.0	6578.0	1644.0
97	features_sounds	0.129710	sounds	4.0	5630.0	1407.0
96	features_snow	0.129710	snow	4.0	4647.0	1161.0
90	features_objects-of-intrigue	0.129710	objects-of-intrigue	7.0	NaN	NaN
154	objects-of-intrigue_features	0.129710	features	7.0	NaN	NaN

	pageviews	num_pubbed	year
pageviews	1.000000	0.926886	0.506975
num_pubbed	0.926886	1.000000	0.650054
year	0.506975	0.650054	1.000000

	7-day-PVs	num_pubbed
published
2015-08-01	662.0	1
2015-06-09	2107.0	1
2014-05-12	1632.0	1
2015-12-30	3650.0	1
2016-01-07	1681.0	1
2015-08-20	624.0	1
2015-12-17	2442.0	1
2015-09-15	749.0	1
2015-10-21	1590.0	1
2015-12-03	1483.0	1

	7-day-PVs	num_pubbed	year
7-day-PVs	1.000000	0.466825	0.158331
num_pubbed	0.466825	1.000000	0.380993
year	0.158331	0.380993	1.000000

	mean-PVs	mean-shares	num_tagged
Tag
maps	18568.0	545.0	60.0
naturecultures	14334.0	314.0	27.0
features	14071.0	356.0	356.0
other capitals of the world	13825.0	391.0	12.0
visual	12602.0	435.0	117.0
video wonders	12324.0	329.0	47.0
video	10834.0	231.0	102.0
list	8637.0	195.0	85.0
news	8146.0	225.0	461.0
extra mile	7989.0	248.0	16.0
found	7756.0	241.0	167.0
columns	7465.0	222.0	212.0
female explorers	6573.0	343.0	15.0
animals	6086.0	259.0	165.0
100 wonders	5687.0	90.0	44.0
objects of intrigue	5029.0	212.0	79.0
fleeting wonders	4452.0	115.0	156.0
art	4317.0	139.0	92.0
places you can no longer go	3948.0	38.0	44.0
morbid monday	1537.0	32.0	45.0

	tags	probabilities
33	news	0.189916
0	100-wonders	0.176529
1	31-days-of-halloween	0.146919
8	cemeteries	0.146576
51	war	0.146576
21	garbage-week	0.146576
10	churches	0.146576
52	wwii	0.146576
3	animals	0.146576
2	abandoned	0.146576
24	libraries	0.091183
11	columns	0.091183
47	underground-week	0.091183
22	holidays	0.091183
38	places-you-can-no-longer-go	0.090890
12	crime-and-punishment	0.090599
35	objects-of-intrigue	0.089517
39	politics	0.084188
6	birds	0.061686
9	cheat-week	0.059556
5	art	0.059297
36	obscura-day	0.056281
37	oceans	0.056281
7	books	0.056281
40	rites-and-rituals	0.056281
32	music	0.056281
43	society-adventures	0.056281
44	space	0.056281
48	video	0.056281
49	video-wonders	0.056281
50	visual	0.056281
41	ruins	0.056281
30	morbid-monday	0.056281
19	food	0.056281
25	list	0.056281
13	curious-fact-of-the-week	0.056281
14	death	0.056281
16	features	0.056281
20	found	0.056281
26	magic	0.056281
4	architecture	0.056281
28	maps	0.056281
29	medicine	0.056281
31	museums	0.051961
15	exploration	0.040709
27	map-monday	0.037582
42	science	0.033551
23	islands	0.033012
18	fleeting-wonders	0.033012
34	notes-from-the-field	0.029644
45	sports	0.029644
17	film	0.022731
46	transportation	0.020998

	100-wonders	19th-century	2016-election	30-rock	31-days-of-halloween	abandoned	abandoned-amusement-parks	abandoned-brooklyn	abandoned-cemetaries	abandoned-hospitals	...	world-s-smallest	world-s-tallest	world-war-ii	wunderkammer	wwi	wwii	yehlui-geological-park	yeti	zombies	zoos
3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
56	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
57	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
58	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
62	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	tags	coefficients
52	wwii	1.065043
28	maps	0.848714
7	books	0.836221
47	underground-week	0.575863
29	medicine	0.567487
2	abandoned	0.562231
9	cheat-week	0.517251
24	libraries	0.385578
30	morbid-monday	0.327623
4	architecture	0.303256
49	video-wonders	0.293608
25	list	0.234942
16	features	0.221080
11	columns	0.204949
31	museums	0.159610
37	oceans	0.134381
10	churches	0.129452
14	death	0.102822
41	ruins	0.055650
32	music	0.055465
21	garbage-week	0.013623
3	animals	0.002736
5	art	0.000338
19	food	-0.007075
50	visual	-0.013276
17	film	-0.021714
1	31-days-of-halloween	-0.023128
20	found	-0.043365
51	war	-0.084437
46	transportation	-0.102804
27	map-monday	-0.204296
22	holidays	-0.280851
26	magic	-0.340270
33	news	-0.403831
35	objects-of-intrigue	-0.430410
23	islands	-0.464792
12	crime-and-punishment	-0.488901
48	video	-0.520194
0	100-wonders	-0.537564
44	space	-0.541108
42	science	-0.557842
18	fleeting-wonders	-0.601600
8	cemeteries	-0.668934
39	politics	-0.709310
36	obscura-day	-0.759159
40	rites-and-rituals	-0.922080
6	birds	-0.941553
15	exploration	-0.959669
34	notes-from-the-field	-0.964113
43	society-adventures	-1.238194
45	sports	-1.278518
13	curious-fact-of-the-week	-1.470357
38	places-you-can-no-longer-go	-1.730003

	tags	probabilities
52	wwii	0.250000
8	cemeteries	0.250000
2	abandoned	0.250000
3	animals	0.250000
51	war	0.250000
21	garbage-week	0.250000
10	churches	0.250000
33	news	0.214286
0	100-wonders	0.214286
12	crime-and-punishment	0.178571
35	objects-of-intrigue	0.142857
42	science	0.107143
9	cheat-week	0.071429
5	art	0.071429
34	notes-from-the-field	0.071429
36	obscura-day	0.071429
37	oceans	0.071429
39	politics	0.071429
40	rites-and-rituals	0.071429
41	ruins	0.071429
43	society-adventures	0.071429
7	books	0.071429
44	space	0.071429
45	sports	0.071429
46	transportation	0.071429
47	underground-week	0.071429
48	video	0.071429
49	video-wonders	0.071429
50	visual	0.071429
4	architecture	0.071429
32	music	0.071429
31	museums	0.071429
30	morbid-monday	0.071429
20	found	0.071429
11	columns	0.071429
13	curious-fact-of-the-week	0.071429
14	death	0.071429
15	exploration	0.071429
16	features	0.071429
17	film	0.071429
18	fleeting-wonders	0.071429
19	food	0.071429
6	birds	0.071429
29	medicine	0.071429
22	holidays	0.071429
23	islands	0.071429
24	libraries	0.071429
25	list	0.071429
27	map-monday	0.071429
28	maps	0.071429
26	magic	0.071429
38	places-you-can-no-longer-go	0.035714
1	31-days-of-halloween	0.000000

	tags	probabilities
0	100-wonders	0.250698
27	map-monday	0.250698
29	medicine	0.250698
30	morbid-monday	0.250698
31	museums	0.250698
32	music	0.250698
33	news	0.250698
34	notes-from-the-field	0.250698
35	objects-of-intrigue	0.250698
36	obscura-day	0.250698
37	oceans	0.250698
38	places-you-can-no-longer-go	0.250698
39	politics	0.250698
40	rites-and-rituals	0.250698
41	ruins	0.250698
42	science	0.250698
43	society-adventures	0.250698
44	space	0.250698
45	sports	0.250698
46	transportation	0.250698
47	underground-week	0.250698
48	video	0.250698
49	video-wonders	0.250698
50	visual	0.250698
51	war	0.250698
28	maps	0.250698
26	magic	0.250698
1	31-days-of-halloween	0.250698
25	list	0.250698
2	abandoned	0.250698
3	animals	0.250698
4	architecture	0.250698
5	art	0.250698
6	birds	0.250698
7	books	0.250698
8	cemeteries	0.250698
9	cheat-week	0.250698
10	churches	0.250698
11	columns	0.250698
12	crime-and-punishment	0.250698
13	curious-fact-of-the-week	0.250698
14	death	0.250698
15	exploration	0.250698
16	features	0.250698
17	film	0.250698
18	fleeting-wonders	0.250698
19	food	0.250698
20	found	0.250698
21	garbage-week	0.250698
22	holidays	0.250698
23	islands	0.250698
24	libraries	0.250698
52	wwii	0.250698

	tags	probabilities
130	women	0.371218
3	aircraft	0.330070
125	visual	0.313056
110	sports	0.303625
71	mummies	0.288925
99	saints	0.272294
115	time-week	0.263755
98	ruins	0.241347
109	space	0.235258
100	science	0.235128
70	mountains	0.233971
103	ships	0.215629
119	tunnels	0.211667
118	trees	0.207395
111	statues	0.204833
116	trains	0.204833
60	magic	0.197443
107	society-adventures	0.191618
0	100-wonders	0.188841
57	libraries	0.182539
112	subterranean	0.182539
132	wwii	0.166283
131	world-s-fair	0.152512
105	skeletons	0.147225
53	insects	0.145729
61	map-monday	0.142249
77	nasa	0.141422
82	no-ones-watching-week	0.141422
80	new-york-city	0.141422
79	naturecultures	0.141422
...	...	...
5	ancient	0.043366
28	design	0.043092
9	architecture	0.042018
52	infrastructure	0.041597
6	animals	0.041193
32	escape-week	0.039818
13	books	0.037367
29	dinosaurs	0.037367
41	fleeting-wonders	0.037367
101	sculptures	0.036208
35	extra-mile	0.035296
20	churches	0.033240
25	crime-and-punishment	0.033240
16	cemeteries	0.033240
14	cats	0.033240
49	halloween	0.033240
50	holidays	0.033240
44	games	0.031995
15	caves	0.031769
42	food	0.031127
27	death	0.028319
30	disaster-areas	0.027857
37	features	0.023676
23	computers	0.021552
34	exploration	0.017840
45	garbage	0.017840
18	china	0.016213
4	airplanes	0.013174
17	cheat-week	0.009833
46	garbage-week	0.006122